+ All Categories
Home > Documents > Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture,...

Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture,...

Date post: 25-Sep-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
36
Scientific Project: Databases for Multi-dimensional Data, Genomics and modern Hardware David Broneske, Gabriel Campero, Bala Gurumurthy, Andreas Meister, Marcus Pinnecke, Roman Zoun, Gunter Saake April 3, 2018
Transcript
Page 1: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Scientific Project: Databases forMulti-dimensional Data, Genomics and

modern Hardware

David Broneske, Gabriel Campero, Bala Gurumurthy, AndreasMeister, Marcus Pinnecke, Roman Zoun, Gunter Saake

April 3, 2018

Page 2: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Overview

I Concepts of this courseI Course of action (milestones, presentations)I Overview of project topics & forming project teamsI How to perform literature research?

I Further lectures:

I Academic writing (2-3 lectures)

David Broneske et al. Scientific Project 2

Page 3: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Overview

I Concepts of this courseI Course of action (milestones, presentations)I Overview of project topics & forming project teamsI How to perform literature research?I Further lectures:

I Academic writing (2-3 lectures)

David Broneske et al. Scientific Project 2

Page 4: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Organization

Page 5: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Scientific Project: Modules

BachelorI Module: WPF FIN SMK (Schlussel- und

Methodenkompetenzen)I 5 CP = 150h ⇒ 42h presence time (3 SWS) + 108h

autonomous workMaster

I Module: Scientific Team Project (Inf, IngInf, WIF, CV)I DKE: Methods 2 or ApplicationsI DE: Interdisciplinary Team Project

I 6 CP = 180h ⇒ 42h presence time (3 SWS) + 138hautonomous work

Grade at the end of the course for the whole project team

David Broneske et al. Scientific Project 4

Page 6: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Scientific Project: Prerequisite

I Successful programming test in C++/JavaI 1h theoretical test in a seminar room (data and place to be

discussed)I Half of the team members have to pass the testI Topics:

I Some language specificsI General program understandingI Control flow understanding

I You can take both tests and have to pass at least one!

David Broneske et al. Scientific Project 5

Page 7: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Scientific Project: Semester Plan

Monday Tuesday Wednesday Thursday FridayIntroduction 02.04.18 03.04.18 04.04.18 05.04.18 06.04.18

09.04.18 10.04.18 11.04.18 12.04.18 13.04.1816.04.18 17.04.18 18.04.18 19.04.18 20.04.18

MS-I 23.04.18 24.04.18 25.04.18 26.04.18 27.04.1830.04.18 01.05.18 02.05.18 03.05.18 04.05.18

Aca-I 07.05.18 08.05.18 09.05.18 10.05.18 11.05.18MSII 14.05.18 15.05.18 16.05.18 17.05.18 18.05.18Aca-II 21.05.18 22.05.18 23.05.18 24.05.18 25.05.18

28.05.18 29.05.18 30.05.18 31.05.18 01.06.1804.06.18 05.06.18 06.06.18 07.06.18 08.06.18

MSIII 11.06.18 12.06.18 13.06.18 14.06.18 15.06.1818.06.18 19.06.18 20.06.18 21.06.18 22.06.1825.06.18 26.06.18 27.06.18 28.06.18 29.06.18

MSFinal 02.07.18 03.07.18 04.07.18 05.07.18 06.07.18

David Broneske et al. Scientific Project 6

Page 8: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Scientific Project: Milestones

I Milestone I - Topic, schedule, and team presentation & firstresults of literature research

I Milestone II - Concept & additional literature researchI Milestone III - Implementation & evaluation setupI Milestone IV - Final presentation (wrap-up + evaluation

results)

David Broneske et al. Scientific Project 7

Page 9: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Concepts & Content

Page 10: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Lecture, Meetings & Presentation

Lecture & PresentationI Time/Place: Tuesday, 9:00-11:00, G29 - room K058I Lectures with content of course → allI Presentation of main milestones (see time table)

→ each project teamMeetings (Exercise)

I Individual for each project teamI Time and room to be agreed in project teams!I Presentation of all intermediate results/milestones (informal)I Discussion, discussion, discussion . . .

David Broneske et al. Scientific Project 9

Page 11: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Objectives & Qualification (I)

Acquired skills, specific to researchI Performing literature researchI Understanding and structured reviewing of scientific workI Autonomous, solution-based reasoning on research task (e.g.,

finding alternative solutions)I How to ask? How to adapt a task (extend/reduce)?I Academic writing

David Broneske et al. Scientific Project 10

Page 12: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Objectives & Qualification (II)

Acquired skills, always neededI Team managementI Project and time schedulingI Presentation of resultsI Flexibility regarding changing conditionsI Reasoning about solutions (”Why is this the best/not

adequate. . . ”)

David Broneske et al. Scientific Project 11

Page 13: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Progress of Course

DeliveriesI 4 milestone presentations (main milestones)I Each team member has to present at least onceI Reporting of (sub) milestones in exercises/meetingsI Written paper about literature research (technical report)I Prototypical implementation

David Broneske et al. Scientific Project 12

Page 14: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Deliveries and Grading (I)

Technical ReportI Delivery of report at a given time (deadline)I Quality/Quantity of literature researchI Number of pagesI Quality of paper structure and evaluationI Own contribution

David Broneske et al. Scientific Project 13

Page 15: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Deliveries and Grading (III)

Presentation & DiscussionI Quality of scientific presentation (structure, references, time)I Assessment regarding the content (e.g., results of particular

milestones)I Participation of discussion

OrganizationI StrictnessI Communication (just-in-time answers, satisfying time

constraints)I Self-organization (Sharing tasks, internal reporting of current

state-of-work, dealing with problems)I Autonomous working

David Broneske et al. Scientific Project 14

Page 16: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Deliveries and Grading (IV)

I Grade consists of:I Presentations: 30%,I Implementation: 30%,I Paper: 30%,I Soft Skills: 10%

I Binding registration: Second Milestone

David Broneske et al. Scientific Project 15

Page 17: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Task & Time Management

Task ManagementI Main milestones have to be finished in timeI (Sub) milestones are less strict (but don’t be sloppy)I Pre-defined work packages ⇒ each project team

I . . . defines sub work packagesI . . . determines responsibilities for these packages

(divide&conquer)Time Management

I Planning of periodsI Regarding capacities and resourcesI Considering other tasks and activitiesI Reporting of delays immediately to project members !

David Broneske et al. Scientific Project 16

Page 18: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Role Management

I Possible roles: team leader, developer, researcher, . . .I work together vs. responsibilities: design, implementation,

testing, writing, . . .I Delegate for important roles/work packagesI Assignment of (sub) tasks to role for each milestone

David Broneske et al. Scientific Project 17

Page 19: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Topic & Project Teams

I Teams with 3 to 5 students (depends on the task)I Most tasks can be chosen onceI Projects

I Theoretical partI State of the artI New ideas

I Practical partI Usually in C++ or JavaI Prototypical implementationI Evaluation part

David Broneske et al. Scientific Project 18

Page 20: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Topic 1 - Join-Order Optimization

IntroI Join Order Optimization needed for efficient query processing

→ NP-hard problemTask

I What are common techniques? (Top-Down-Approaches ...)I What are used optimization within Top-Down-Approaches?I Prototypical implementation using C++I Tune algorithms to performance (e.g., using a profiler)I Evaluate their performance and draw conclusionsI Compare to other algorithms

David Broneske et al. Scientific Project 19

Page 21: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Topic 2 - Applied GridFormationIntro

I GridFormation is our developing concept for framing commonphysical design optimizations under a single reinforcementlearning formulation. Currently, we use a Deep Q-Learningmodel.

I Collaborate with us in applying our solution to an existingDBMS.

Your TaskI Literature Research: Data Partitioning, Reinforcement

Learning, Model Management.I Prototypical implementation integrating our solution as an

online solution for a DBMS.I Experimental evaluation & novel contributions to our current

design and implementation.David Broneske et al. Scientific Project 20

Page 22: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Topic 3 - Workload Forecasting w/RNNsIntro

I DBMSs should be able to use reliable workload forecasts, foroptimizing their performance.

I Recurrent Neural Networks can produce such forecasts.I What are the challenges and opportunities for integrating

efficient RNN forecasting into a DBMS?Your Task

I Literature research: Deep Learning for forecasting.I Prototypical implementations and tuning of an RNN for

forecasting database workloads.I Evaluate the performance of the model (possibly incl. GPU vs

CPU), the quality of predictions and how a prototypicalstorage engine could benefit from them.

I Suggest improvements and outline limitations.David Broneske et al. Scientific Project 21

Page 23: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Topic 4 - Join and Grouped Aggregation Using GranularOperations

IntroI Join and aggregate are compute intensiveI Use similar algorithm for processingI Common functionalities can be shared

We’ve gotI OpenCL frameworkI Data parallel processing primitives

Your TaskI Literature Research: Different join and grouped aggregation

methods availableI Understanding data parallel primitives and their granularitiesI Concepts for using primitives for join and grouped aggregation

processingI Implementation of your concept of primitive based executionI Compare primitive execution with stand-alone algorithm

David Broneske et al. Scientific Project 22

Page 24: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Topic 5 - If and While in GeckoScriptIntro

I Our graph database system GeckoDB comes with a powerful scriptableenvironment, the shell

I Modification and querying with Cypher-Like query language + scripting w/GeckoScript

I Conditional execution is missing: IF var THEN block ELSE block ENDI Loops are missing: WHILE var block END

We’ve got

I GeckoDB incl. system shell (variable-/stack based virtual machine)I Working scripting language w/ (scoped) variables, functions, etc...I Examples on function definitions, full access to sources (written in C)

Your Task

I Implement IF and WHILE statements for GeckoScript in CI Evaluate shell incl. IF and WHILE on std. functions you have to write in

GeckoScript (e.g., fibonacci) by comparing to 2+ languages of your choice(e.g. Phyton, JavaScript, LUA, ...)

David Broneske et al. Scientific Project 23

Page 25: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Topic 6 - Thread pool in GeckoDB/BOLSTERIntro

I Our graph database system GeckoDB is built from ground-up to supportmodern hardware, e.g., massive parallelism

I The heart of parallelism-support is a library that we developed which focus onbulk-data operations, BOLSTER

I Currently, threads are spawned on-demand which may leads to thousands of(short-living) threads

I For 3-5 Bachelor/Master students

We’ve got

I GeckoDB incl. system incl. example where ”the pain” isI BOLSTER is ready to use and connected to the components, so everything is

setup

Your Task

I Implement a thread pool (or similar strategy) in C for BOLSTER to avoidshort-living threads

I Evaluate a given example w/ and w/o your extension

David Broneske et al. Scientific Project 24

Page 26: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Topic 7 - Cold Data Avoidance for ElfIntro

I Cold data traversal for queries on a little amount of columnsI Worst case: Mono-column selection predicates

We’ve gotI Elf implementation

Your TaskI Literature Research: Related index structures and cold data

managementI Understanding of the Elf and its optimization conceptsI Implementation of Elfs for Mono-column selection predicates,

Pointers into TID listsI Performance evaluation of the variantsI Investigate ratio of storage overhead and performance gain

David Broneske et al. Scientific Project 25

Page 27: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Topic 8 - Sort Queries in Elf

IntroI Sorting is a data-intensive taskI Elf stores data already sorted → column order determines

effectiveness

We’ve gotI Elf implementation

Your TaskI Literature Research: Sorting queries on partially indexed dataI Understanding of the Elf and its optimization conceptsI Implementation of additional sortings for ElfsI Performance evaluation compared to a sort from scratch

David Broneske et al. Scientific Project 26

Page 28: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Topic 9 - SIMD for Elf

IntroI Elf nodes similar to B-tree nodesI Zeuch et al. introduced SIMD for B-tree

We’ve gotI Elf implementation

Your TaskI Literature Research: SIMD for tree structuresI Understanding of the Elf and its optimization conceptsI Implementation of SIMD Elf and its performance evaluation

David Broneske et al. Scientific Project 27

Page 29: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Finding your Team

Topics:I Topic 1 - Join-Order OptimizationI Topic 2 - Applied GridFormationI Topic 3 - Workload Forecasting w/RNNsI Topic 4 - Join and Grouped Aggregation Using Granular

OperationsI Topic 5 - If and While in GeckoScriptI Topic 6 - Thread pool in GeckoDB/BOLSTERI Topic 7 - Cold Data Avoidance for ElfI Topic 8 - Sort Queries in ElfI Topic 9 - SIMD for Elf

David Broneske et al. Scientific Project 28

Page 30: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Literature Research

Page 31: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

How to Perform Literature Research

I Efficient literature research requiresI Knowledge of Where to searchI Knowledge of How to searchI Finding adequate search termsI Structured review of papersI Knowledge of how to find information in papers

David Broneske et al. Scientific Project 30

Page 32: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Where to Search (I)

I Different websites available that provide large literaturedatabases

1. Google Scholar: http://scholar.google.de/I Key word and conrete paper searchI Often, PDFs are provided

2. DBLP: http://www.informatik.uni-trier.de/˜ley/db/I Search for keyword, conferences, journals, author(s)I BibTex and references to other websites

3. Citeseer: http://citeseerx.ist.psu.edu/about/siteI keyword, fulltext, author, and title searchI BibTex and (partially) PDFs are provided

David Broneske et al. Scientific Project 31

Page 33: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Where to Search (II)

I Publisher sites are also a suitable targetI ACM Digital Library: http://portal.acm.org/dl.cfm

I Keyword, author, conference/literature (proceedings), and titlesearch

I Bibtex, mostly PDFs and other information are providedI IEEE Xplore: http://ieeexplore.ieee.org/Xplore/

guesthome.jsp?reload=trueI Similar to ACM, but only few PDFsI Extended access within university network

I Springer: http://www.springerlink.de/I Similar to previousI Extended access within university Network

I Further search possibilities: on author, research group oruniversity sites

David Broneske et al. Scientific Project 32

Page 34: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

How to Search

Some hints to not get lost in the jungleI Use distinct keywords (fingerprint vs. fingerprint data)I Keep keywords simple (at most three words)I Otherwise, search for whole titleI Read abstract (and maybe introduction) ⇒ decision for

relevanceFirst insights

I Read abstract, introduction and background/related work(coarse-grained) to

I . . . get a first idea of the approachI . . . find other relevant papers

David Broneske et al. Scientific Project 33

Page 35: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Information Retrieval

Finding the required informationI Read the paper carefullyI Omit formal parts/sectionsI Try to classify (core idea, main characteristics) ⇒ develop

classification/evaluation in mindI Understand the big pictureI Make notesI Do NOT translate each sentence

David Broneske et al. Scientific Project 34

Page 36: Scientific Project: Databases for Multi-dimensional Data ...Team+Project/_/Intro.pdfLecture, Meetings & Presentation Lecture & Presentation I Time/Place: Tuesday, 9:00-11:00, G29 -

Finding your Team

Topics:I Topic 1 - Join-Order OptimizationI Topic 2 - Applied GridFormationI Topic 3 - Workload Forecasting w/RNNsI Topic 4 - Join and Grouped Aggregation Using Granular

OperationsI Topic 5 - If and While in GeckoScriptI Topic 6 - Thread pool in GeckoDB/BOLSTERI Topic 7 - Cold Data Avoidance for ElfI Topic 8 - Sort Queries in ElfI Topic 9 - SIMD for Elf

When do we meet for the programming test?

David Broneske et al. Scientific Project 35


Recommended