ECSA 2013 (Cuesta)

TOWARDS

AN ARCHITECTURE

FOR MANAGING

BIG SEMANTIC DATA

IN REAL-TIME

Carlos E. Cuesta, VorTIC3, URJC, Spain

Miguel A. Martínez-Prieto, UVa, Spain

Javier D. Fernández, UVa, Spain & UChile, Chile

Montpellier, France, 02/07/2013

CONTENTS

Introduction

Problem Statement

Context: the RDF world

Proposal: SOLID Architecture

Unfolding in five Layers

SOLID in Practice

The RDF/HDT format

The SOLID/HDT Architecture

Conclusions & Future work

2

INTRODUCTION

Big Data has become an important topic

When the size of the data itself becomes part of the

problem (Loukides)

Characterized by the “three Vs”

Volume: large amounts of data gathered and stored

The challenge is storage, but also computing

Volume is relative: depends on available resources

Velocity: different flows of data at different rates

Variety: the kind of structures within the data

Each source has its own semantics

Need of a logical model to allow data integration

Architecture for Big Data must consider all these

3

INTRODUCTION

One of the dimensions gets always critical

E.g. storage in mobile applications, velocity in real-

time applications (vs. batch processes)

We promote variety

The dataset value is increased when multiple sources

are integrated, achieving more knowledge

This also influences velocity and volume

We choose a graph-based model

Allows to manage higher levels of variety

Data can be linked and queried together

In practice, this means using RDF as data model

The cornerstone of the “practical” Semantic Web

The basis of the emergent Web of Data 4

PROBLEM STATEMENT

Most solutions to manage Big Data intend to

maximize the volume dimension

Therefore promoting efficient storage

Datastores able to cope with large datasets

Indexing strategies to achieve high space

Datastores must be assumed to be stable

In spite of the assumed immutability property

But, the volume of incoming data is also big

Datastores must be periodically updated & reindexed

This is very complex in a Real-Time context

Data must be received and integrated in real time

No time to process the flow of incoming data 5

OUR PROPOSAL: SOLID ARCHITECTURE

We propose an specific architecture to manage

Real-Time flows in this context

A multi-tiered architecture

Separate comsuption of Big Semantic Data…

… from the complexities of Real-Time operation

Data must be preserved compact

It is stored and indexed in a compressed way

Data & Index Layers

Needs to efficiently cope with data updates

The reason for the Online Layer

Needs to query all of this together

The reason for the Service Layer 6

CONTEXT: RDF

RDF: Resource Description Framework

Data described as (subject, predicate, object) triples

An RDF dataset is a graph of knowledge

Entities linked to values via labelled edges

Essential for Linked Open Data

Adopted in many different contexts

Simple integration: everything has an URI

7

John Car owns

CONTEXT: RDF

The origin of the Web of Data

Two datasets can become connected by a single triple

<“Station #123, location, Canal Street>

The web becomes data-centric

Every unit is a small piece of data

“The Big Data’s long tail”

But their integration in large contexts become

complex: Big Semantic Data

A variety of sources become easily integrated

RDF is not a serialization format

Describes what data is stored, not how this is done 8

SOLID ARCHITECTURE

10

INDEX LAYER

New Data

Dump

Rd

DataStore

DATA LAYER

Big Data

MERGE LAYER (BATCH)

Query

Join

SERVICE LAYER

ONLINE LAYER

Parallelizable Processing

SOLID ARCHITECTURE

11

INDEX LAYER

New Data

Dump

Rd

DataStore

DATA LAYER

Big Data

MERGE LAYER (BATCH)

Query

Join

SERVICE LAYER

ONLINE LAYER

Parallelizable Processing

RDF

SPARQL

SOLID ARCHITECTURE

Online Layer

Receives incoming new data

Deals with real-time needs

Data Layer

The core of the architecture

The main datastore: the Big Data repository

Stores compressed RDF

Index Layer

Provides an index for the Data Layer, to make

possible high-speed access

Most accesses to the repository are made through it 12

SOLID ARCHITECTURE

Service Layer

The façade to the external user

Able to ask federated SPARQL queries to the

separate datastores in different layers

Every query is distributed, and the different answers

are joined

Merge Layer

Makes possible to integrate the two datastores

Receives a dump of data of the online layer

Integrates that with the existing repository

Producing a fresh copy of the Data Layer

Immutability properties are preserved 13

SOLID IN PRACTICE

This abstract architecture is possible due to

application to existing technology

In particular, the RDF/HDT binary format

Decisions must be taken, layer by layer, about

how to actually implement it

Other alternatives would also be possible (and some

of them are also being implemented)

Data-Centric Layers

Do not use a textual RDF representation

Inefficient, prevents some potential uses

RDF/HDT is a binary format

Conceived specifically for serialization purposes 14

SOLID IN PRACTICE

RDF/HDT format

Designed for machine processing

About 15 times less space than equivalent formats

Uses compact (compressed) data structures

Data Layer

Big Semantic Data in RDF/HDT

Data saving and guaranteed immutability

Instant mapping to memory

Allow querying withoug decompressing

Index Layer

Implements the HDT/FoQ proposal

Lightweight index on top of the HDT binary format

Efficient SPARQL retrieval without decompressing 15

SOLID IN PRACTICE

Online Layer

Copes with the incoming flow of real-time data

HDT is inadequate (designed for read-only)

Must resolve SPARQL efficiently

Choose a general-purpose NoSQL technology

Still able to dump data in an RDF format

Service Layer

Resolves any potential queries

SPARQL considered expressive enough

Queries are forwarded to Online and Index Layers

Their results are retrieved and combined

Using an (scalable) Pipe-Filter approach

16

SOLID IN PRACTICE

Merge Layer

Able to combine incoming data from the Online Layer

with the existing datastore in the Data Layer

The data dump is merged into a copy of the datastore

Then the fresh datastore replaces the previous one

Periodical process, can also be manually triggered

Requires high-performance computation

In practice, this means a Map/Reduce approach

Raw RDF data from Online Layer is converted

Then ordered for internal merging

Depends on the size of the smaller store

Also triggers reindexing the Index Layer 17

SOLID ARCHITECTURE IN PRACTICE

18

INDEX LAYER

New Data

Dump

Rd

NoSQL

DATA LAYER

RDF/HDT

MERGE LAYER (BATCH)

HADOOP

SPARQL

SPARQL + P/F

SERVICE LAYER

ONLINE LAYER

Semantic Data

CONCLUSIONS & FUTURE WORK

We propose SOLID as a generic architecture for

managing Big Semantic Data

Our particular implementation relies on HDT

Also NoSQL for real-time incoming data

Cassandra, but (still) not the only choice

Map/Reduce (Hadoop) for intensive processing

Highly effective in terms of space & time

Initial empirical results are very significant

Currently developing an optimized prototype

Already working on variants of the architecture

Limited version for mobile devices

The Merge Layer is not directly requred 19

THANKS FOR YOUR ATTENTION

20

Date post:	11-May-2015
Category:	Education
Upload:	carlos-e-cuesta
View:	131 times
Download:	0 times

ECSA 2013 (Cuesta)

Education