Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades

transcript

Dr. Rubén Casado

ruben.casado@treelogic.com

ruben_casado

Paradigmas de procesamiento en Big Data: estado actual,

tendencias y oportunidades

UNIVERSIDAD COMPLUETENSE MADRID 19 de Noviembre de 2014

1. Big Data processing

2. Batch processing

3. Streaming processing

4. Hybrid computation model

5. Open Issues & Conclusions

Agenda

PhD in Software Engineering MSc in Computer Science BSc in Computer Science

Academics

WorkExperience

1. Big Data processing2. Batch processing

Agenda

A massive volume of both

structured and unstructured data

that is so large to process with

traditional database and software

techniques

What is Big Data?

Big Data are high-volume, high-velocity,

and/or high-variety information assets that

require new forms of processing to enable

enhanced decision making, insight

discovery and process optimization

How is Big Data?

- Gartner IT Glossary -

3 problems

Volume

Variety Velocity

3 solutions

Batch processing

NoSQLStreaming processing

3 solutions

Batch processing

NoSQLStreaming processing

Volume

Variety Velocity

Science or Engineering?

Volume

Variety

Velocity

Science or Engineering?

Volume

Variety

Velocity

SoftwareEngineering

Data Science

Relational Databases Schema based ACID (Atomicity, Consistency, Isolation, Durability)

Performance penalty Scalability issues

NoSQL Not Only SQL Families of solutions Google BigTable, Amazon Dynamo BASE = Basically Available, Soft state, Eventually consistent CAP= Consistency, Availability, Partition tolerance

Key-value Key: ID Value: associated data Diccionario LinkedIn Voldemort Riak, Redis Memcache, Membase

Document More complex tan K-V Documents are indexed by ID Multiple index MongoDB CouchDB

Column Tables with predefined families of

fields Fields within families are flexible Vertical and horizontal partitioning HBase Cassandra

Graph Nodes Relationships Neo4j FlockDB OrientDB

CR7: ‘Cristiano Ronaldo’

CR7:{Name: ’Cristiano’Surname: ‘Ronaldo’Age: 29}

CR7: [Personal:{Name: ’Cristiano’Surname: ‘Ronaldo’Edad: 29}

Job: {Team: ‘R. Madrid’Salary: 20.000.000}]

[R.Madrid]

se_llama

[Cristiano]

• Scalable• Large amount of static data

• Distributed

• Parallel

• Fault tolerant

• High latency

Batch processing

Volume

• Low latency• Continuous

unbounded streams of data

• Distributed

• Parallel

• Fault-tolerant

Streaming processing

Velocity

• Low latency: real-time• Massive data-at-rest + data-in-motion• Scalable

• Combine batch and streaming results

Hybrid computation model

Volume Velocity

All data

New data

Batch processing

Streaming processing

Batchresults

Streamresults

CombinationFinal results

Hybrid computation model

Batch processing Large amount of statics data Scalable solution Volume

Streaming processing Computing streaming data Low latency Velocity

Hybrid computation Lambda Architecture Volume + Velocity

1ª Generation

2ª Generation

3ª Generation

Inception

2003Processing Paradigms

+10 years of Big Data processing technologies

2003 2004 2005 2013201120102008

The Google File System

MapReduce: Simplified Data Processing on Large Clusters

Doug Cutting starts developing Hadoop

Yahoo! starts working on Hadoop

Apache Hadoop is in production

Nathan Marzcreates Storm

Yahoo! creates S4

Facebook creates Hive

Yahoo! creates Pig

MillWheel: Fault-Tolerant Stream Processing at Internet Scale

LinkedIn presents Samza

LinkedIn presents KafkA

Clouderapresents Flume

Nathan Marzdefines the Lambda Architecture

Streaming Hybrid

Spark stack is open sourced Lambdoop &

Summinbgirdfirst steps

StratospherebecomesApache Flink

Processing Pipeline

DATA ACQUISITION

DATA STORAGE

DATA ANALYSIS RESULTS

Static stations and mobile sensors in Asturias sending streaming data

Historical data of > 10 years

Monitoring, trends identification, predictions

Air Quality case study

1. Big Data processing overview

2. Batch processing3. Real-time processing

Agenda

Batch processing technologies

DATA ACQUISITION

DATA STORAGE

o HDFS commands

o Sqoopo Flumeo Scribe

o HDFSo HBase

o MapReduceo Hiveo Pigo Cascadingo Sparko Spark SQL (Shark)

• Import to HDFS

hadoop dfs -copyFromLocal<path-to-local> <path-to-remote>

hadoop dfs –copyFromLocal/home/hduser/AirQuality/ /hdfs/AirQuality/

HDFS commands DATA ACQUISITION

• Tool designed for transferring data between HDFS/HBase and structural datastores

• Based in MapReduce• Includes connectors for multiple databases

o MySQL, o PostgreSQL, o Oracle, o SQL Server and o DB2 o Generic JDBC connector

• Java API

Sqoop DATA ACQUISITION

import -all-tables --connect

jdbc:mysql://localhost/testDatabase

--target-dir

hdfs://rootHDFS/testDatabase --

username user1 --password pass1 -m 1

1) Import data from database to HDFS

export --connect

jdbc:mysql://localhost/testDatabase

--export-dir

hdfs://rootHDFS/testDatabase --

username user1 --password pass1 -m 1

3) Export results to database

Sqoop DATA ACQUISITION

• Service for collecting, aggregating, and moving large amounts of log data

• Simple and flexible architecture based on streaming data flows

• Reliability, scalability, extensibility, manageability• Support log stream types

o Avroo Syslogo Netcast

Flume DATA ACQUISITION

Sources Channels SinksAvro Memory HDFSThrift JDBC LoggerExec File AvroJMS Thrift

NetCat IRCSyslog

TCP/UDPFile Roll

HTTP NullHBase

Custom Custom

• Architectureo Source

• Waiting for events .o Sink

• Sends the information towardsanother agent or system.

o Channel• Stores the information until it is

consumed by the sink.

Stations send the information to the servers. Flume collects

this information and move it into the HDFS for further analsys Air quality syslogs

Station; Tittle; latitude; longitude; Date ; SO2; NO; CO; PM10; O3; dd; vv; TMP; HR; PRB;

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "8"; "0.35"; "13"; "67"; "158"; "3.87"; "18.8"; "34"; "982";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "7"; "0.32"; "16"; "66"; "158"; "4.03"; "19"; "35"; "981"; "23";

"1";"Estación Avenida Constitución";"43.529806";"-5.673428";"2001-01-01"; "7"; "6"; "0.26"; "24"; "68"; "158"; "3.76"; "19.1"; "36"; "980"; "23";

• Server for aggregating log data streamed in real time from a large number of servers

• There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups.

• The central scribe server(s) can write the messages to the files that are their final destination

Scribe DATA ACQUISITION

category=‘mobile‘;

// '1; 43.5298; -5.6734; 2000-01-01; 23; 89; 1.97; …'message= sensor_log.readLine();

log_entry = scribe.LogEntry(category, message)

// Create a Scribe Client

client = scribe.Client(iprot=protocol, oprot=protocol)

transport.open()

result = client.Log(messages=[log_entry])

transport.close()

• Sending a sensor message to a Scribe Server

Scribe DATA ACQUISITION

• Distributed FileSystem for Hadoop• Master-Slaves Architecture (NameNode – DataNodes)

o NameNode: Manage the directory tree and regulates access to files by clients

o DataNodes: Store the data• Files are split into blocks of the same size and these blocks are

stored and replicated in a set of DataNodes

HDFS DATA STORAGE

• Open-source non-relational distributed column-oriented database modeled after Google’s BigTable.

• Random, realtime read/write access to the data.

• Not a relational database.

o Very light «schema»

• Rows are stored in sorted order.

DATA STORAGE

• Framework for processing large amount of data in parallelacross a distributed cluster

• Slightly inspired in the Divide and Conquer (D&C) classic strategy

• Developer has to implement Map and Reduce functions:

o Map: It takes the input, partitions it up into smaller sub-problems, and distributes them to worker nodes parsed to the format <K, V>

o Reduce: It collects the <K, List(V)> and generates the results

MapReduce DATA ANALYTICS

• Design Patterns

o Joinso Reduce side Joino Replicated joino Semi join

o Sorting:o Secondary sorto Total Order Sort

o Filtering

MapReduce

o Statisticso AVGo VARo Counto …

o Top-Ko Binningo …

DATA ANALYTICS

• Obtain the S02 average of each station

MapReduce

DATA ANALYTICS

Input Data

Mapper

<1, 6> …

<1, 2> <3, 1> <1, 9>

<3, 9> <2, 6> <2, 6> <1, 6>

<2, 0> <2, 8> <1, 2> <3,9>

<Station_ID, S02_VALUE>

• Maps get records and produce the SO2 value in <Station_Id, SO2_value>

Station_ID, AVG_SO21, 2,013

2, 2,695

3, 3,562ReducerSum

Divide

ReducerSum

Divide

<Station_ID, [SO1, SO2,…,SOn>

• Reducer receives <Station_Id, List<SO2_value> > and computes the average for the station

• Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hocqueries, and the analysis of large datasets

• Abstraction layer on top of MapReduce

• SQL-like language called HiveQL.• Metastore: Central repository of Hive metadata.

DATA ANALYTICS

CREATE TABLE air_quality(Estacion int, Titulo string, latitud double, longitud double, Fecha string, SO2 int, NO int, CO float, …)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘;'

LINES TERMINATED BY '\n'

STORED AS TEXTFILE;

LOAD DATA INPATH '/CalidadAire_Gijon' OVERWRITE INTO TABLE calidad_aire;

• Obtain the S02 average of each stationSELECT Titulo, avg(SO2)

FROM air_quality

GROUP BY Estacion

DATA ANALYTICS

• Platform for analyzing large data sets • High-level language for expressing data

analysis programs. Pig Latin. Data flow programming language.

• Abstraction layer on top of MapReduce• Procedural language

Pig DATA ANALYTICS

calidad_aire = load '/CalidadAire_Gijon' using PigStorage(';')

AS (estacion:chararray, titulo:chararray, latitud:chararray,

longitud:chararray, fecha:chararray, so2:chararray,

no:chararray, co:chararray, pm10:chararray, o3:chararray,

dd:chararray, vv:chararray, tmp:chararray, hr:chararray,

prb:chararray, rs:chararray, ll:chararray, ben:chararray,

tol:chararray, mxil:chararray, pm25:chararray);

grouped = GROUP air_quality BY estacion;

avg = FOREACH grouped GENERATE group, AVG(so2);

dump avg;

• Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows

• Makes development of complex Hadoop MapReduce workflows easy

• In the same way that Pig

DATA ANALYTICS

Cascading

// define source and sink Taps.

Tap source = new Hfs( sourceScheme, inputPath );

Scheme sinkScheme = new TextLine( new Fields( “Estacion", “SO2" ) );

Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

Pipe assembly = new Pipe( “avgSO2" );

assembly = new GroupBy( assembly, new Fields( “Estacion" ) );

// For every Tuple group

Aggregator avg = new Average( new Fields( “SO2" ) );

assembly = new Every( assembly, avg );

// Tell Hadoop which jar file to use

Flow flow = flowConnector.connect( “avg-SO2", source, sink, assembly );

// execute the flow, block until complete

flow.complete();

DATA ANALYTICS

Cascading

• Cluster computing systems for faster data analytics

• Not a modified version of Hadoop

• Compatible with HDFS• In-memory data storage for very fast iterative

processing• MapReduce-like engine• API in Scala, Java and Python

DATA ANALYTICS

Spark DATA ANALYTICS

• Hadoop is slow due to replication, serialization and IO tasks

Spark DATA ANALYTICS

• 10x-100x faster

Spark SQL

• Large-scale data warehouse system for Spark

• SQL on top of Spark (aka SHARK)

• Actually Hive QL over Spark

• Up to 100 x faster than Hive

DATA ANALYTICS

Pros• Faster than Hadoop ecosystem

• Easier to develop new applications

o (Scala, Java and Python API)

• Not tested in extremely large clusters yet

• Problems when Reducer’s data does not fit in memory

DATA ANALYTICS

2. Batch processing

3. Streaming processing4. Hybrid computation model

Agenda

Real-time processing technologies

DATA ACQUISITION

DATA STORAGE

o Flume o Kafkao Kestrel

o Flumeo Stormo Tridento S4o Spark Streaming

STREAM

• Kafka is a distributed, partitioned, replicated commit log service

o Producer/Consumer model

o Kafka maintains feeds of messages in categories called topics

o Kafka is run as a cluster

Kafka DATA STORAGE

STREAM

Insert AirQuality sensor log file into Kafka cluster and consume the info.

// new Producter

Producer<String, String> producer = new Producer<String, String>(config);

//Open sensor log file

BufferedReader br…

String line;

while(true)

line = br.readLine();

if(line ==null)

… //wait;

producer.send(new KeyedMessage<String, String>(topic, line));

Kafka DATA STORAGE

STREAM

AirQuality Consumer

ConsumerConnector consumer = Consumer.createJavaConsumerConnector(config);

Map<String, Integer> topicCountMap = new HashMap<String, Integer>();

topicCountMap.put(topic, new Integer(1));

Map<String, List<KafkaMessageStream>> consumerMap = consumer.createMessageStreams(topicCountMap);

KafkaMessageStream stream = consumerMap.get(topic).get(0);

ConsumerIterator it = stream.iterator();

while(it.hasNext()){

// consume it.next()

Kafka DATA STORAGE

STREAM

• Simple distributed message queue

• A single Kestrel server has a set of queues (strictly-ordered FIFO)

• On a cluster of Kestrel servers, they don’t know about each other and don’t do any cross communication

• Kestrel vs Kafka

o Kafka consumers cheaper (basically just the bandwidth usage)

o Kestrel does not depend on Zookeeper which means it is operationally less complex if you don't already have a zookeeper installation.

o Kafka has significantly better throughput.

o Kestrel does not support ordered consumption

Kestrel DATA STORAGE

STREAM

Interceptor• Interface org.apache.flume.interceptor.Interceptor• Can modify or even drop events based on any criteria• Flume supports chaining of interceptors.• Types:

o Timestamp interceptoro Host interceptoro Static interceptoro UUID interceptoro Morphline interceptoro Regex Filtering interceptoro Regex Extractor interceptor

DATA ANALYTICS

STREAM

• The sensors’ information must be filtered by "Station 2" o An interceptor will filter information between Source and Channel.

Flume DATA ANALYTICS

STREAM

# Write format can be text or writable

#Defining channel – Memory type …1

#Defining source – Syslog …

# Defining sink – HDFS …

#Defining interceptor

agent.sources.source.interceptors = i1

agent.sources.source.interceptors.i1.type = org.apache.flume.interceptor.StationFilter

class StationFilter implements Interceptor

if(!"Station".equals("2"))

discard data;

save data;

Flume DATA ANALYTICS

STREAM

Hadoop StormJobTracker NimbusTaskTracker SupervisorJob Topology

• Distributed and scalable realtime computation system

• Doing for real-time processing what Hadoop did for batch processing

• Topology: processing graph. Each node contains processing logic(spouts and bolts). Links between nodes are streams of datao Spout: Source of streams. Read a data source and emit the data into the

topology as a streamo Bolts: Processing unit. Read data from several streams, does some

processing and possibly emits new streamso Stream: Unbounded sequence of tuples. Tuples can contain any

serializable object

Storm DATA ANALYTICS

STREAM

CAReader LineProcessor AvgValues

• AirQuality average values

oStep 1: build the topology

Spout Bolt Bolt

DATA ANALYTICS

STREAM

oStep 1: build the topology

TopologyBuilder AirAVG= new TopologyBuilder();

builder.setSpout("ca-reader", new CAReader(), 1);

//shuffleGrouping -> even distribution

AirAVG.setBolt("ca-line-processor", new LineProcessor(), 3)

.shuffleGrouping("ca-reader");

//fieldsGrouping -> fields with the same value goes to the same task

AirAVG.setBolt("ca-avg-values", new AvgValues(), 2)

.fieldsGrouping("ca-line-processor", new Fields("id"));

Storm DATA ANALYTICS

STREAM

public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {

//Initialize fileBufferedReader br = new ……

public void nextTuple() {String line = br.readLine();if (line == null) {

return;} else

collector.emit(new Values(line));}

Storm• AirQuality average values

oStep 2: CAReader implementation (IRichSpout interface)

DATA ANALYTICS

STREAM

public void declareOutputFields (OutputFieldsDeclarerdeclarer) {

declarer.declare(new Fields("id", "stationName", "lat", …

public void execute (Tuple input, BasicOutputCollectorcollector) {

collector.emit(new Values(input.getString(0).split(";");

oStep 3: LineProcessor implementation (IBasicBolt interface)

DATA ANALYTICS

STREAM

public void execute (Tuple input, BasicOutputCollector collector)

//totals and count are hashmaps with each station accumulated values

if (totals.containsKey(id)) {

item = totals.get(id);

count = counts.get(id);

else {

//Create new item}

//update values

item.setSo2(item.getSo2()+Integer.parseInt(input.getStringByField("so2")));

item.setNo(item.getNo()+Integer.parseInt(input.getStringByField("no")));

oStep 4: AvgValues implementation (IBasicBolt interface)

DATA ANALYTICS

STREAM

• High level abstraction on top of Stormo Provides high level operations (joins, filters,

projections, aggregations, functions…)

Proso Easy, powerful and flexible

o Incremental topology development

o Exactly-once semantics

Conso Very few built-in functions

o Lower performance and higher latency than Storm

Trident DATA ANALYTICS

STREAM

Simple Scalable Streaming System

Distributed, Scalable, Fault-tolerant platform for processing continuous unbounded streams of data

Inspired by MapReduce and Actor models of computationo Data processing is based on Processing Elements (PE)

o Messages are transmitted between PEs in the form of events (Key, Attributes)

o Processing Nodes are the logical hosts to PEs

S4 DATA ANALYTICS

STREAM

<list>

<value>LogLines *</value></list>

</property></bean><bean id="average" class="AveragePE">

<value>CAItem stationId</value></list>

</property></bean>

S4 DATA ANALYTICS

STREAM

Spark Streaming

• Spark for real-time processing

• Streaming computation as a series of very short batch jobs (windows)

• Keep state in memory

• API similar to Spark

DATA ANALYTICS

STREAM

2. Batch processing

4. Hybrid computation model5. Open Issues & Conclusions

Agenda

• We are in the beginning of this generation

• Short-term Big Data processing goal

• Abstraction layer over the Lambda Architecture

• Promising technologies

o SummingBird

o Lambdoop

Hybrid Computation Model

SummingBird

• Library to write MapReduce-like process that can be executed on Hadoop, Storm or hybrid model

• Scala syntaxis

• Same logic can be executed in batch, real-time and hybrid bath/real mode

HYBRIDCOMPUTATION

SummingBird HYBRIDCOMPUTATION

Pros• Hybrid computation model

• Same programing model for all proccesing paradigms

• ExtensibleCons

• MapReduce-like programing

• Scala

• Not as abstract as some users would like

SummingBird HYBRIDCOMPUTATION

Software abstraction layer over Open Source technologieso Hadoop, HBase, Sqoop, Flume, Kafka, Storm, Trident

Common patterns and operations (aggregation, filtering, statistics…) already implemented. No MapReduce-like process

Same single API for the three processing paradigms

o Batch processing similar to Pig / Cascading

o Real time processing using built-in functions easier than Trident

o Hybrid computation model transparent for the developer

Lambdoop HYBRIDCOMPUTATION

Lambdoop

Data Operation Data

Workflow

Streaming data

Static data

HYBRIDCOMPUTATION

DataInput db_historical = new StaticCSVInput(URI_db);

Data historical = new Data (db_historical);

Workflow batch = new Workflow (historical);

Operation filter = new Filter (“Station", “=", 2);

Operation select = new Select (“Titulo“, “SO2");

Operation group = new Group(“Titulo");

Operation average = new Average (“SO2");

batch.add(filter);

batch.add(select);

batch.add(group);

batch.add(variance);

batch.run();

Data results = batch.getResults();

DataInput stream_sensor = new StreamXMLInput(URI_sensor);

Data sensor = new Data(stream_sensor)

Workflow streaming = new Workflow (sensor, new WindowsTime(100) );

Operation filter = new Filter ("Station", "=", 2);

Operation select = new Select ("Titulo", "S02");

Operation group = new Group("Titulo");

Operation average = new Average ("S02");

streaming.add(filter);

streaming.add(select);

streaming.add(group);

streaming.add(average);

streaming.run();

While (true){

Data live_results = streaming.getResults();…

DataInput historical= new StaticCSVInput(URI_folder);

DataInput stream_sensor= new StreamXMLInput(URI_sensor);

Data all_info = new Data (historical, stream_sensor);

Workflow hybrid = new Workflow (all_info, new WindowsTime(1000) );

Operation filter = new Filter ("Station", "=", 2);

Operation select = new Select ("Titulo", "SO2");

Operation group = new Group("Titulo");

Operation average = new Average ("SO2");

hybrid.add(filter);

hybrid.add(select);

hybrid.add(group);

hybrid.add(variance);

hybrid.run();

Data updated_results = hybrid.getResults();

Pros• High abstraction layer for all processing model

• All steps in the data processing pipeline

• Same Java API for all programing paradigms

• ExtensibleCons

• Ongoing project

• Not open-source yet

• Not tested in larger cluster yet

2. Batch processing

Agenda

Open Issues

• Interoperability between well-known techniques / technologies (SQL, R) and Big Data platforms (Hadoop, Spark)

• European technologies (Stratosphere / Apache Flink)

• Massive Streaming Machine Learning

• Real-time Interactive Visual Analytics

• Vertical (domain-driven) solutions

Conclusions

Casado R., Younas M. Emerging trends and technologies in big data processing. Concurrency Computat.: Pract. Exper. 2014

Conclusions

• Big Data is not only Hadoop

• Identify the processing requirements of your project

• Analyze the alternatives for all steps in the data pipeline

• The battle for real-time processing is open

• Stay tuned for the hybrid computation model

Thanks for your attention!

Questions?

ruben.casado@treelogic.com

ruben_casado

Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportunidades

Education