Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data...

transcript

Big Data Europe : Empowering Communities with Data Technologies

Dr Hajira Jabeen, University of Bonn

Apache Big Data Europe - Seville

Structure◎Evolution of BDE Architecture◎BDE Stack◎Beyond the State of the Art

o Workflows (Support Layer)o User Interfaceo Data Lake (Ontario)o Semantics Analytics Stack

Architectural design 1

Architectural design 24

Architectural design 3 (released)

Architecture for SC 7 6

Supported ComponentsSearch/indexing Data processing

Apache Solr Apache Spark

Data acquisition Apache Flink

Apache Flume Semantic Components

Message passing Strabon

Apache Kafka Sextant

Data storage GeoTriples

Hue Silk

Apache Cassandra SEMAGROW

ScyllaDB LIMES

Apache Hive 4Store

Postgis OpenLink Virtuoso

User profiles8

Platform installation◎Manual installation guide◎Using Docker Machine

o On local machine (VirtualBox)o In cloud (AWS, DigitalOcean, Azure)o Bare metal

◎Screencasts

Developing a component◎Base Docker images

o Serve as a template for a (Big Data) technology

o Easily extendable custom algorithm/data◎Published components

o Responsibilities divided b/w partnerso Image repositories on GitHubo Automated builds on DockerHubo Documentation on BDE Wiki

Deploying a Big Data Stack

◎Stack collection of communicating components to solve a specific problem◎Described in Docker Compose

o Component configurationo Application topology

◎Orchestrator required for initialization processo Components may depend on each othero Components may require manual

intervention

User Interfaces◎Target: Facilitate use of the platform

o User Interface Adaption ◎Available interfaces

o Workflow UIs❖Workflow Builder❖Workflow Monitor

o Swarm UIo Integrator UI

BDE Workflow Builder13

BDE Workflow Monitor14

Swarm UI15

Integrator UI16

Beyond the State of the Art

Smart Big Data

Increase Big Data value by adding meaning to it!

Orchestration ◎ Goal: flexible composition of Big Data

pipelines◎ Microservice cooperation using shared

vocabulary ◎ Uses HTTP requests from a web based

frontend◎ Meta information stored in a triple store◎ Microservice uses triple store as service

backend

Semantic Data Lake (Ontario)

◎Repository of data in its raw formato Structured, semi-structured, unstructured

◎Schema-lesso No schema is defined on write, it is

defined only on read◎Open to any kind of processing

Data Lake 21

Source: Big Data @ Microsoft - Raghu Ramakrishnan

Semantic Data Lake (Ontario)

◎Add a Semantic layer on top of the source datasetso Semantic data is handled as-iso Non-Semantic data is semantically lifted

using existing ontology terms

Ontario: Architecture23

Translate and execute Query via Source-specific Access Method

Decompose to Source-specific Entities

Decompose SPARQL Query

Semantic Analytics Stack (SANSA)

SANSA: Motivation◎ Abundant machine readable structured

information is available (e.g. in RDF)o Across SCs, e.g. Life Science Data

(OpenPhacts)o General: DBpedia, Google knowledge

grapho Social graphs: Facebook, Twitter

◎ Need for scalable querying, inference and machine learningo Link predictiono Knowledge base completiono Predictive analytics

SANSA Stack27

SANSA: Read Write Layer◎ Ingest RDF and OWL data in different formats using Jena / OWL API style interfaces

◎Represent data in multiple formats (e.g. RDD, Data Frames, GraphX, Tensors)

◎Allow transformation among these formats

◎Compute dataset statistics and apply functions to URIs, literals, subjects, objects → Distributed LODStats

SANSA: Query Layer◎To make generic queries efficient and fast using:o Intelligent indexing o Splitting strategieso Distributed Storage o Distributed/ Federated Querying

◎Early work in progress: query evaluation (SPARQL-to-SQL approaches, Virtual Views)

◎Provision of W3C SPARQL compliant endpoint

SANSA: Inference Layer◎W3C Standards for Modelling: RDFS and OWL

◎Parallel in-memory inference via rule-based forward chaining

◎Beyond state of the art: dynamically build a rule dependency graph for a rule set

◎→ Adjustable performance levels

SANSA: ML Layer◎Distributed Machine Learning (ML) algorithms that work on RDF data and make use of its structure / semantics

◎Work in Progress:o Tensor Factorization for e.g. KB completion

(testing stage)o Simple spatiotemporal analytics (idea stage)o Graph Clustering (testing stage)o Association rule mining (evaluation stage)o Semantic Decision trees (idea stage)

Thank you32

jabeen@iai.uni-bonn.de

BDE vs Hadoop distributions

Hortonworks Cloudera MapR Bigtop BDE

File System HDFS HDFS NFS HDFS HDFSInstallation Native Native Native Native lightweight

virtualizationPlug & play components (no rigid schema)

no no no no yes

High Availability Single failure recovery (yarn)

Single failure recovery (yarn)

Self healing, mult. failure rec.

Single failure recovery (yarn)

Multiple Failure recovery

Cost Commercial Commercial Commercial Free Free

Scaling Freemium Freemium Freemium Free FreeAddition of custom components

Not easy No No No Yes

Integration testing yes yes yes yes --Operating systems Linux Linux Linux Linux AllManagement tool Ambari Cloudera

managerMapR Control system

- Docker swarm UI+ Custom

BDE vs Hadoop distributions

◎BDE is not built on top of existing distributions

◎Targets o Communitieso Research institutions

◎Bridges scientists and open data◎Multi Tier research efforts towards Smart Data

Apache Big_Data Europe event: "Integrators at work! Real-life applications of Apache Big Data...

Technology