Big Data PraktikumAbteilung Datenbanken
Sommersemester 2017
Orga
Ziel: Entwurf und Realisierung einer Anwendung / eines Algorithmus unter Verwendung existierender Big Data Frameworks
Ablauf
Anwesenheitspflicht der Gruppe zu allen Testaten
Bis Anfang April Erstes Treffen mit Betreuer (Terminanfrage per Mail)
Ende Mai Testat 1: System kennenlernen / Datenimport / Lösungsskizze
Mitte/Ende Juli Testat 2: Implementierung und Ergebnisse vorstellen
Anfang August Testat 3: Präsentation
15 Minuten pro Gruppe
Anwesenheitspflicht aller Praktikumsteilnehmer
Technische Details
Quellcode: GitHub Repository Gruppe => Collaborators
Werden nach Praktikum zu https://github.com/leipzig-bigdata-lab geforked
Java: Apache Maven 3 für Projekt Management
Test Driven Development erwünscht Siehe Dokumentation zu Unit Tests in jeweiligen Frameworks
Quellcode Dokumentation zwingend erforderlich!
Stabile Versionen verwenden (ggf. Rücksprache) z.B. Flink 1.1.2
Lokal lauffähige Lösungen können auf dediziertem Cluster ausgeführt werden
Terminabsprache Anfang Juli mit [email protected]
Datensätze https://github.com/caesar0301/awesome-public-datasets
Is the globe really warming?Yin-Chi Lin
Mr Trump's top advisers are currently divided on the issue (Paris climate agreement), with some, including Environmental Protection Agency head Scott Pruitt, eager for the US to leave the deal. "Paris is something that we need to really look at closely, because it's something we need to exit, in my opinion," Mr Pruitt said in an interview with Fox News Channel's "Fox & Friends" last week. "It's a bad deal for America. It was an America second, third or fourth kind of approach."
Is the globe warming? • If yes, since when and at what magnitude?
• Are there regional differences (e.g. between different continents, countries, climate zones …)?
• Are there seasonal differences?
• Is the rise of temperature really correlated to the increase of CO2 emission?
• “Europe’s Atlantic-facing countries will suffer heavier rainfalls, greater flood risk, more severe storm damage and an increase in “multiple climatic hazards…”
…….
IS THE GLOBE WARMING? Tuesday 18 April 2017
Data Sources & Tools
GHCN-Daily dataset (Global Historical Climatology Network):
• 1763-2017• more than 100,000 stations across the globe
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/
ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/
https://www1.ncdc.noaa.gov/pub/data/cdo/documentation/GHCND_documentation.pdf
Global CO2 Emissions from Fossil-Fuel Burning, Cement Manufacture, and Gas Flaring:
• 1751-2014
http://cdiac.ornl.gov/ftp/ndp030/global.1751_2014.ems
Tools:SparkR + Map visualization tool
Analysing Metabolic Networks in Gradoop
Anika Groß
Analysing Metabolic Networks in Gradoop
• Modellierung metabolischer Netzwerke und biochemischer Reaktionen im EPGM (Extended Property Graph Modell)
• Transformation und Import in Gradoop• Daten von http://bigg.ucsd.edu
[1] Lanzenia, Messinaa, Archettia: Graph models and mathematical programming in biochemicalnetwork analysis and metabolic engineering design, Computers & Mathematics with Applications, 2008.[2] Junghanns, Petermann: Verteilte Graphanalyse mit Gradoop. JavaSPEKTRUM 05/2016.
[2]
http://bigg.ucsd.edu
[1]
• Datenanalyse: „Hub“-Moleküle, Suche nach Mustern, Finden häufiger Subgraphen, …
Analyzing PanamaPapers with Gradoop
Eric Peukert
Analyzing Panama Papers with Gradoop
• Loading Panama Papers with Gradoop (Neo4J-Connector or from CSV)
• Viszalize Schema
• Implement analytical workflows
• Optional: link with additional sources in Germany such as people/companies from dbpedia
Analytics of Development Project Data
Eric Peukert
Analytics of Development Project Data
https://issues.apache.org/jira/rest/api/2/project
Analytical Workflows
Analysis of LOD datasets within GradoopMarkus Nentwig
Linked Open Data
• Structured, interlinked data using standardtechnologies• HTTP dereference entities
• RDF machine-readable data exchange format
• URIs identification of entities
Gradoop: Distributed Graph Analytics
- Use graph operators like aggregation, grouping or subgraph to analyse data
- Ontop of Apache Flink- Extended Property Graph Model- Support for different data sources like
- CSV, JSON, …- Currently missing: RDF
Tasks:
Data SourceGradoopAnalysis
Data Sink
- Implement data source and data sink for RDF data format
- Based on existing data sources- Import/export LOD data set- Handle RDF reification
- Analyze a given dataset with simple Gradoopoperators
Analytics of Publication Data with GraphuloMatthias Kricke
Analytics of Publication Data with Graphulo
Technologies• Graphulo, which is based on
• Apache Accumulo (Distributed DBS)• Apache Hadoop HDFS (Distributed
Filesystem)
Data• DBLP
• open bibliographic information on major computer science journals and proceedings
Task• Import DBLP into graphulo• Analyze DBLP by the means of graphulo
• Graph diameter?• Size of biggest connected component?• …
Cluster Environment
Classification of program traces using TensorFlow or Caffe
Martin Grimmer
Classification of program traces usingTensorFlow
• A program trace is the sequence of system calls of a program.
54 175 120 175 175 3 175 175 120 175 120 175 120 175 175 120 175 3 3 3 175 120 175 175 175 7 3 3 175 120 175 7 175 7 119 174 54 3 3 175 175 3 120 175 175 120 175 120 120 175 175 54 140 3 175 120 175 175 175 175 175 174 7 175 7 119 3 3 175 3 175 175 120 175 7 175 3 175 120 175 175 54 7 174 3 175 120 7 175 175 120 175 175 3 175 120 175 3 3 120 175 120 175 175 7 54 175 120 175 7 175 7 119 174 54 3 120 175 175 120 54 3 120 175 175 54 140 175 175 174 54 175 120 175 175 54 140
• TensorFlow is a open source library for artificial intelligence.• https://www.tensorflow.org/
• The task:Build a classifier with TensorFlow that learns what is normal.
-> One class classification problem!
Use this classifier to test unknown system traces for
abnormal behavior.
Speed up Entity Resolution with Bit Arrays
Ziad Sehili
Entity Resolution
Jaccard similarity= # Intersection_tokens / # union_tokens = 7/10
Build tokens: trigrams(tommas schmidt)={tom, omm, mma, mas, sch, chm, hmi, mid, idt}trigrams(tomas schmidt) = {tom, oma, mas, sch, chm, hmi, mid, idt}
Find records in different databases that refer to the same real world object
Entity Resolution With Bit Arrays
tom, omm, mma, mas, sch, chm, hmi, mid, idt
0 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
tommas schmidt
tomas schmidt
h
0 1 0 0
0 1 0 0 1 1 0 1 0 0 1 0 1 1 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 1 0 0
tom, oma, mas, sch, chm, hmi, mid, idt
h
Jaccard similarity= AND / OR = 7/9
Problems:1. How to get similar/same quality as string comparison? (length of bit array to avoid collisions 303???) or increase the number of hash functions!!!
2. Does this method improve the runtime?
Parameter Tuning for Entity-Resolution Problems
Victor Christen
Parameter Tuning for Entity-Resolution Problems
• Quality depends on the similarity function sim_f• Determined by compared attributes and weights for each attribute
combination
attribute
Name
Description
Price
0.7
0.4
0.9
𝑠𝑖𝑚_𝑓
0.6
0.4
0.5
𝑠𝑖𝑚_𝑓
Classification problem• Entity Resolution as
classification problemTask
• Determine logistic regression classifier based on given similarity vectors and a training data set
• Evaluate different training data set sizes by determining quality, variance,…
Advanced
• Investigate the impact of different classifiers according to similar vectors
Technology
• SparkML• Logistic Regression, K-Means
name
de
scri
pti
on
Graphbased Similarities for medical concepts
• Relatedness determined by using an knowledgebase such as an ontology• Ontologies represent the backbone of the Semantic Web
• Structure knowledge by defining concepts and relations between concepts, such as “Heart infarction”, “diabetes mellitus”,…
• Hierarchical structure of concepts
Related?
Concept Similarity
• Similarities based on basic measures
B
E
A
F
C
H
D
KG I L
Measure Concept(C)
#subsumer 2
#leaves 3
#leaves
5
Local depth-first search based implementation needs more than a day up to weeks!!!!
Measure Concept(C)
#subsumer 2
#leaves 3
Measure Concept(C)
#subsumer 2
#leaves 3
Measure Concept(C)
#subsumer 2
#leaves 3
Measure Concept(C)
#subsumer 2
#leaves 4
disease
isa
Heart infarction
general
specialized
Task & Requirements
Data
• Extracted directed acyclic graph (DAG) from the Unified Medical Language System• 2.2 Mio Concepts, 2.9 Mio Relations
Task
• Parallel traversal algorithm to determine the measures
Technology
• Apache Flink/Gelly
ThemenübersichtThema FW #Studenten Betreuer
Is the globe really warming? SparkR 2 Lin
Analysing Metabolic Networks in Gradoop
Gradoop/Flink 2 Groß
Analyzing PanamaPapers with Gradoop Gradoop/Flink 2 Peukert
Analytics of Development Project Data Gradoop/Flink 2 Peukert
Analyse LOD datasets within Gradoop Gradoop/Flink 2 Nentwig
Analytics of Publication Data with Graphulo
Apache Accumolo 2 Kricke
Classification of program traces using TensorFlow or Caffe
TensorFlow or Caffe/Python/C++
2 Grimmer
Speed up Entity Resolution with Bit Arrays
2 Sehili
Graph-based Similarities for medical concepts
Flink 2 Christen
Parameter Tuning for Entity-Resolution Problems
SparkML 2 Christen