Scientific Project: Databases forMulti-dimensional Data, Genomics and
modern Hardware
David Broneske, Gabriel Campero, Bala Gurumurthy, AndreasMeister, Marcus Pinnecke, Roman Zoun, Gunter Saake
April 3, 2018
Overview
I Concepts of this courseI Course of action (milestones, presentations)I Overview of project topics & forming project teamsI How to perform literature research?
I Further lectures:
I Academic writing (2-3 lectures)
David Broneske et al. Scientific Project 2
Overview
I Concepts of this courseI Course of action (milestones, presentations)I Overview of project topics & forming project teamsI How to perform literature research?I Further lectures:
I Academic writing (2-3 lectures)
David Broneske et al. Scientific Project 2
Organization
Scientific Project: Modules
BachelorI Module: WPF FIN SMK (Schlussel- und
Methodenkompetenzen)I 5 CP = 150h ⇒ 42h presence time (3 SWS) + 108h
autonomous workMaster
I Module: Scientific Team Project (Inf, IngInf, WIF, CV)I DKE: Methods 2 or ApplicationsI DE: Interdisciplinary Team Project
I 6 CP = 180h ⇒ 42h presence time (3 SWS) + 138hautonomous work
Grade at the end of the course for the whole project team
David Broneske et al. Scientific Project 4
Scientific Project: Prerequisite
I Successful programming test in C++/JavaI 1h theoretical test in a seminar room (data and place to be
discussed)I Half of the team members have to pass the testI Topics:
I Some language specificsI General program understandingI Control flow understanding
I You can take both tests and have to pass at least one!
David Broneske et al. Scientific Project 5
Scientific Project: Semester Plan
Monday Tuesday Wednesday Thursday FridayIntroduction 02.04.18 03.04.18 04.04.18 05.04.18 06.04.18
09.04.18 10.04.18 11.04.18 12.04.18 13.04.1816.04.18 17.04.18 18.04.18 19.04.18 20.04.18
MS-I 23.04.18 24.04.18 25.04.18 26.04.18 27.04.1830.04.18 01.05.18 02.05.18 03.05.18 04.05.18
Aca-I 07.05.18 08.05.18 09.05.18 10.05.18 11.05.18MSII 14.05.18 15.05.18 16.05.18 17.05.18 18.05.18Aca-II 21.05.18 22.05.18 23.05.18 24.05.18 25.05.18
28.05.18 29.05.18 30.05.18 31.05.18 01.06.1804.06.18 05.06.18 06.06.18 07.06.18 08.06.18
MSIII 11.06.18 12.06.18 13.06.18 14.06.18 15.06.1818.06.18 19.06.18 20.06.18 21.06.18 22.06.1825.06.18 26.06.18 27.06.18 28.06.18 29.06.18
MSFinal 02.07.18 03.07.18 04.07.18 05.07.18 06.07.18
David Broneske et al. Scientific Project 6
Scientific Project: Milestones
I Milestone I - Topic, schedule, and team presentation & firstresults of literature research
I Milestone II - Concept & additional literature researchI Milestone III - Implementation & evaluation setupI Milestone IV - Final presentation (wrap-up + evaluation
results)
David Broneske et al. Scientific Project 7
Concepts & Content
Lecture, Meetings & Presentation
Lecture & PresentationI Time/Place: Tuesday, 9:00-11:00, G29 - room K058I Lectures with content of course → allI Presentation of main milestones (see time table)
→ each project teamMeetings (Exercise)
I Individual for each project teamI Time and room to be agreed in project teams!I Presentation of all intermediate results/milestones (informal)I Discussion, discussion, discussion . . .
David Broneske et al. Scientific Project 9
Objectives & Qualification (I)
Acquired skills, specific to researchI Performing literature researchI Understanding and structured reviewing of scientific workI Autonomous, solution-based reasoning on research task (e.g.,
finding alternative solutions)I How to ask? How to adapt a task (extend/reduce)?I Academic writing
David Broneske et al. Scientific Project 10
Objectives & Qualification (II)
Acquired skills, always neededI Team managementI Project and time schedulingI Presentation of resultsI Flexibility regarding changing conditionsI Reasoning about solutions (”Why is this the best/not
adequate. . . ”)
David Broneske et al. Scientific Project 11
Progress of Course
DeliveriesI 4 milestone presentations (main milestones)I Each team member has to present at least onceI Reporting of (sub) milestones in exercises/meetingsI Written paper about literature research (technical report)I Prototypical implementation
David Broneske et al. Scientific Project 12
Deliveries and Grading (I)
Technical ReportI Delivery of report at a given time (deadline)I Quality/Quantity of literature researchI Number of pagesI Quality of paper structure and evaluationI Own contribution
David Broneske et al. Scientific Project 13
Deliveries and Grading (III)
Presentation & DiscussionI Quality of scientific presentation (structure, references, time)I Assessment regarding the content (e.g., results of particular
milestones)I Participation of discussion
OrganizationI StrictnessI Communication (just-in-time answers, satisfying time
constraints)I Self-organization (Sharing tasks, internal reporting of current
state-of-work, dealing with problems)I Autonomous working
David Broneske et al. Scientific Project 14
Deliveries and Grading (IV)
I Grade consists of:I Presentations: 30%,I Implementation: 30%,I Paper: 30%,I Soft Skills: 10%
I Binding registration: Second Milestone
David Broneske et al. Scientific Project 15
Task & Time Management
Task ManagementI Main milestones have to be finished in timeI (Sub) milestones are less strict (but don’t be sloppy)I Pre-defined work packages ⇒ each project team
I . . . defines sub work packagesI . . . determines responsibilities for these packages
(divide&conquer)Time Management
I Planning of periodsI Regarding capacities and resourcesI Considering other tasks and activitiesI Reporting of delays immediately to project members !
David Broneske et al. Scientific Project 16
Role Management
I Possible roles: team leader, developer, researcher, . . .I work together vs. responsibilities: design, implementation,
testing, writing, . . .I Delegate for important roles/work packagesI Assignment of (sub) tasks to role for each milestone
David Broneske et al. Scientific Project 17
Topic & Project Teams
I Teams with 3 to 5 students (depends on the task)I Most tasks can be chosen onceI Projects
I Theoretical partI State of the artI New ideas
I Practical partI Usually in C++ or JavaI Prototypical implementationI Evaluation part
David Broneske et al. Scientific Project 18
Topic 1 - Join-Order Optimization
IntroI Join Order Optimization needed for efficient query processing
→ NP-hard problemTask
I What are common techniques? (Top-Down-Approaches ...)I What are used optimization within Top-Down-Approaches?I Prototypical implementation using C++I Tune algorithms to performance (e.g., using a profiler)I Evaluate their performance and draw conclusionsI Compare to other algorithms
David Broneske et al. Scientific Project 19
Topic 2 - Applied GridFormationIntro
I GridFormation is our developing concept for framing commonphysical design optimizations under a single reinforcementlearning formulation. Currently, we use a Deep Q-Learningmodel.
I Collaborate with us in applying our solution to an existingDBMS.
Your TaskI Literature Research: Data Partitioning, Reinforcement
Learning, Model Management.I Prototypical implementation integrating our solution as an
online solution for a DBMS.I Experimental evaluation & novel contributions to our current
design and implementation.David Broneske et al. Scientific Project 20
Topic 3 - Workload Forecasting w/RNNsIntro
I DBMSs should be able to use reliable workload forecasts, foroptimizing their performance.
I Recurrent Neural Networks can produce such forecasts.I What are the challenges and opportunities for integrating
efficient RNN forecasting into a DBMS?Your Task
I Literature research: Deep Learning for forecasting.I Prototypical implementations and tuning of an RNN for
forecasting database workloads.I Evaluate the performance of the model (possibly incl. GPU vs
CPU), the quality of predictions and how a prototypicalstorage engine could benefit from them.
I Suggest improvements and outline limitations.David Broneske et al. Scientific Project 21
Topic 4 - Join and Grouped Aggregation Using GranularOperations
IntroI Join and aggregate are compute intensiveI Use similar algorithm for processingI Common functionalities can be shared
We’ve gotI OpenCL frameworkI Data parallel processing primitives
Your TaskI Literature Research: Different join and grouped aggregation
methods availableI Understanding data parallel primitives and their granularitiesI Concepts for using primitives for join and grouped aggregation
processingI Implementation of your concept of primitive based executionI Compare primitive execution with stand-alone algorithm
David Broneske et al. Scientific Project 22
Topic 5 - If and While in GeckoScriptIntro
I Our graph database system GeckoDB comes with a powerful scriptableenvironment, the shell
I Modification and querying with Cypher-Like query language + scripting w/GeckoScript
I Conditional execution is missing: IF var THEN block ELSE block ENDI Loops are missing: WHILE var block END
We’ve got
I GeckoDB incl. system shell (variable-/stack based virtual machine)I Working scripting language w/ (scoped) variables, functions, etc...I Examples on function definitions, full access to sources (written in C)
Your Task
I Implement IF and WHILE statements for GeckoScript in CI Evaluate shell incl. IF and WHILE on std. functions you have to write in
GeckoScript (e.g., fibonacci) by comparing to 2+ languages of your choice(e.g. Phyton, JavaScript, LUA, ...)
David Broneske et al. Scientific Project 23
Topic 6 - Thread pool in GeckoDB/BOLSTERIntro
I Our graph database system GeckoDB is built from ground-up to supportmodern hardware, e.g., massive parallelism
I The heart of parallelism-support is a library that we developed which focus onbulk-data operations, BOLSTER
I Currently, threads are spawned on-demand which may leads to thousands of(short-living) threads
I For 3-5 Bachelor/Master students
We’ve got
I GeckoDB incl. system incl. example where ”the pain” isI BOLSTER is ready to use and connected to the components, so everything is
setup
Your Task
I Implement a thread pool (or similar strategy) in C for BOLSTER to avoidshort-living threads
I Evaluate a given example w/ and w/o your extension
David Broneske et al. Scientific Project 24
Topic 7 - Cold Data Avoidance for ElfIntro
I Cold data traversal for queries on a little amount of columnsI Worst case: Mono-column selection predicates
We’ve gotI Elf implementation
Your TaskI Literature Research: Related index structures and cold data
managementI Understanding of the Elf and its optimization conceptsI Implementation of Elfs for Mono-column selection predicates,
Pointers into TID listsI Performance evaluation of the variantsI Investigate ratio of storage overhead and performance gain
David Broneske et al. Scientific Project 25
Topic 8 - Sort Queries in Elf
IntroI Sorting is a data-intensive taskI Elf stores data already sorted → column order determines
effectiveness
We’ve gotI Elf implementation
Your TaskI Literature Research: Sorting queries on partially indexed dataI Understanding of the Elf and its optimization conceptsI Implementation of additional sortings for ElfsI Performance evaluation compared to a sort from scratch
David Broneske et al. Scientific Project 26
Topic 9 - SIMD for Elf
IntroI Elf nodes similar to B-tree nodesI Zeuch et al. introduced SIMD for B-tree
We’ve gotI Elf implementation
Your TaskI Literature Research: SIMD for tree structuresI Understanding of the Elf and its optimization conceptsI Implementation of SIMD Elf and its performance evaluation
David Broneske et al. Scientific Project 27
Finding your Team
Topics:I Topic 1 - Join-Order OptimizationI Topic 2 - Applied GridFormationI Topic 3 - Workload Forecasting w/RNNsI Topic 4 - Join and Grouped Aggregation Using Granular
OperationsI Topic 5 - If and While in GeckoScriptI Topic 6 - Thread pool in GeckoDB/BOLSTERI Topic 7 - Cold Data Avoidance for ElfI Topic 8 - Sort Queries in ElfI Topic 9 - SIMD for Elf
David Broneske et al. Scientific Project 28
Literature Research
How to Perform Literature Research
I Efficient literature research requiresI Knowledge of Where to searchI Knowledge of How to searchI Finding adequate search termsI Structured review of papersI Knowledge of how to find information in papers
David Broneske et al. Scientific Project 30
Where to Search (I)
I Different websites available that provide large literaturedatabases
1. Google Scholar: http://scholar.google.de/I Key word and conrete paper searchI Often, PDFs are provided
2. DBLP: http://www.informatik.uni-trier.de/˜ley/db/I Search for keyword, conferences, journals, author(s)I BibTex and references to other websites
3. Citeseer: http://citeseerx.ist.psu.edu/about/siteI keyword, fulltext, author, and title searchI BibTex and (partially) PDFs are provided
David Broneske et al. Scientific Project 31
Where to Search (II)
I Publisher sites are also a suitable targetI ACM Digital Library: http://portal.acm.org/dl.cfm
I Keyword, author, conference/literature (proceedings), and titlesearch
I Bibtex, mostly PDFs and other information are providedI IEEE Xplore: http://ieeexplore.ieee.org/Xplore/
guesthome.jsp?reload=trueI Similar to ACM, but only few PDFsI Extended access within university network
I Springer: http://www.springerlink.de/I Similar to previousI Extended access within university Network
I Further search possibilities: on author, research group oruniversity sites
David Broneske et al. Scientific Project 32
How to Search
Some hints to not get lost in the jungleI Use distinct keywords (fingerprint vs. fingerprint data)I Keep keywords simple (at most three words)I Otherwise, search for whole titleI Read abstract (and maybe introduction) ⇒ decision for
relevanceFirst insights
I Read abstract, introduction and background/related work(coarse-grained) to
I . . . get a first idea of the approachI . . . find other relevant papers
David Broneske et al. Scientific Project 33
Information Retrieval
Finding the required informationI Read the paper carefullyI Omit formal parts/sectionsI Try to classify (core idea, main characteristics) ⇒ develop
classification/evaluation in mindI Understand the big pictureI Make notesI Do NOT translate each sentence
David Broneske et al. Scientific Project 34
Finding your Team
Topics:I Topic 1 - Join-Order OptimizationI Topic 2 - Applied GridFormationI Topic 3 - Workload Forecasting w/RNNsI Topic 4 - Join and Grouped Aggregation Using Granular
OperationsI Topic 5 - If and While in GeckoScriptI Topic 6 - Thread pool in GeckoDB/BOLSTERI Topic 7 - Cold Data Avoidance for ElfI Topic 8 - Sort Queries in ElfI Topic 9 - SIMD for Elf
When do we meet for the programming test?
David Broneske et al. Scientific Project 35