A real time architecture using Hadoop and Storm @ FOSDEM 2013

Post on 26-Jan-2015

105 views 3 download

Tags:

description

 

transcript

A real-time architecture using Hadoop and Storm.

A real-time architecture using Hadoop & Storm. 2

Speakers

Nathan Bijnens@nathan_gs

Geert Van Landeghem@gvanlandeghem

A real-time architecture using Hadoop & Storm. 3

Our Vision

Big Data

test

Volume

A real-time architecture using Hadoop & Storm. 4

Big Data

test

Velocity

A real-time architecture using Hadoop & Storm. 5

Our Vision

Volume

test

Variety

A real-time architecture using Hadoop & Storm. 6

Credits

Nathan MarzEngineer at Backtype(now Twitter).

Storm

Cascalog

ElephantDB

manning.com/marz

A real-time architecture using Hadoop & Storm. 7

A Data System

A real-time architecture using Hadoop & Storm. 8

Not all information is equal. Some information is derived from other pieces of

information.

Data is more than Information

A real-time architecture using Hadoop & Storm. 9

Eventually you will reach the most

This is the information you hold true, simple because it exists.

Data is more than Information

A real-time architecture using Hadoop & Storm. 10

EventsEverything we do generates events:- Pay with Credit Card

- Commit to Git

- Click on a webpage

- Tweet

A real-time architecture using Hadoop & Storm. 11

Events used to manipulate the master data.

Events - Before

A real-time architecture using Hadoop & Storm. 12

Today, events are the master data.

Events - After

A real-time architecture using Hadoop & Storm. 13

everything.

Data System

A real-time architecture using Hadoop & Storm. 14

Data is Immutable

Events

A real-time architecture using Hadoop & Storm. 15

Data is Time Based

Events

A real-time architecture using Hadoop & Storm. 16

Capturing change traditionally

Person Location

Nathan Antwerp

Geert Dendermonde

John Ghent

Person Location

Nathan Ghent

Geert Dendermonde

John Ghent

A real-time architecture using Hadoop & Storm. 17

Capturing change

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

A real-time architecture using Hadoop & Storm. 18

The data you query is often transformed, aggregated, ...

Query

A real-time architecture using Hadoop & Storm. 19

Query

Query = function ( data )

A real-time architecture using Hadoop & Storm. 20

Number of people living in each city.

Person Location Time

Nathan Antwerp 2005-01-01

Geert Dendermonde 2011-10-08

John Ghent 2010-05-02

Nathan Ghent 2013-02-03

Location Count

Ghent 2

Dendermonde 1

A real-time architecture using Hadoop & Storm. 21

Query

All Data Query

A real-time architecture using Hadoop & Storm. 22

Query: Precompute

All Data QueryPrecomputed

View

A real-time architecture using Hadoop & Storm. 23

Layered Architecture

Speed Layer

Batch Layer

Serving Layer

A real-time architecture using Hadoop & Storm. 24

Layered Architecture

HadoopElephant

DB

Qu

ery

Incoming Data

Cassandra

A real-time architecture using Hadoop & Storm. 25

Batch Layer

A real-time architecture using Hadoop & Storm. 26

Batch Layer

HadoopElephant

DB

Incoming Data

A real-time architecture using Hadoop & Storm. 27

Unrestrained computation.

Batch Layer

A real-time architecture using Hadoop & Storm. 28

Horizontal scalable.

Batch Layer

A real-time architecture using Hadoop & Storm. 29

High Latency.

matter.

Batch Layer

A real-time architecture using Hadoop & Storm. 30

Stores master copy of data set...

Batch Layer

append only.

A real-time architecture using Hadoop & Storm. 31

Batch Layer

A real-time architecture using Hadoop & Storm. 32

Batch: View generation

Master Dataset

View #1

View #3

View #2MapReduce

A real-time architecture using Hadoop & Storm. 33

1. Take a large problem and divide it into sub-problems

2. Perform the same function on all sub-problems

3. Combine the output from all sub-problems

Output

MA

PRED

UC

E

MapReduce

DoWork() DoWork() DoWork()…

A real-time architecture using Hadoop & Storm. 34

Read only database.No random writes required.

Batch View Database

A real-time architecture using Hadoop & Storm. 35

Batch View DatabaseElephantDB

Splout

A real-time architecture using Hadoop & Storm. 36

Batch Layer

Not yet absorbed.

Data absorbed into Batch Views

Time No

w

Just a few hours of data.

A real-time architecture using Hadoop & Storm. 37

Speed Layer

A real-time architecture using Hadoop & Storm. 38

Overview

HadoopElephant

DB

Incoming Data

Cassandra

A real-time architecture using Hadoop & Storm. 39

Stream processing.

Speed Layer

A real-time architecture using Hadoop & Storm. 40

Continuous computation.

Speed Layer

A real-time architecture using Hadoop & Storm. 41

Transactional.

Speed Layer

A real-time architecture using Hadoop & Storm. 42

Storing a limited window of data.Compensating for the last few hours of data.

Speed Layer

A real-time architecture using Hadoop & Storm. 43

All the complexity is isolated in the Speed layer auto-

corrected.

Speed Layer

A real-time architecture using Hadoop & Storm. 44

CAPYou have a choice between:

Availability- Queries are eventual consistent.

Consistency- Queries are consistent.

A real-time architecture using Hadoop & Storm. 45

Some algorithms are hard to implement in real time. For those cases we could

estimate the results.

Eventual accuracy

A real-time architecture using Hadoop & Storm. 46

Speed Layer

Incoming Data

Real Time

View 1

Real Time

View 2

A real-time architecture using Hadoop & Storm. 47

StormMessage passing.

Distributed processing.

Horizontally scalable.

Incremental algorithms.

Fast.

Data in motion.

A real-time architecture using Hadoop & Storm. 48

StormMessage passing.

Distributed processing.

Horizontally scalable.

Incremental algorithms.

Fast.

Data in motion.

A real-time architecture using Hadoop & Storm. 49

Storm

Nimbus Zookeeper

Worker Node

Supervisor

Wo

rker

Wo

rker

Wo

rker

Worker Node

Supervisor

Wo

rker

Wo

rker

Wo

rkerWorker Node

SupervisorW

orker

Wo

rker

Wo

rker

A real-time architecture using Hadoop & Storm. 50

StormTuple

Stream

A real-time architecture using Hadoop & Storm. 51

StormSpout

Bolt

A real-time architecture using Hadoop & Storm. 52

StormGrouping

A real-time architecture using Hadoop & Storm. 53

Speed Layer ViewsThe views are stored in Read & Write database.- Cassandra

- Hbase

- MongoDB

- MySQL

- ElasticSearch

-

Much more complex than a read only view.

A real-time architecture using Hadoop & Storm. 54

Serving Layer

A real-time architecture using Hadoop & Storm. 55

Overview

HadoopElephant

DB

Qu

ery

Incoming Data

Cassandra

A real-time architecture using Hadoop & Storm. 56

This layer queries the Batch & Real Time views and merges it.

Serving Layer

A real-time architecture using Hadoop & Storm. 57

Serving Layer

Real Time Views

Merge

Batch Views

A real-time architecture using Hadoop & Storm. 58

Overview

A real-time architecture using Hadoop & Storm. 59

Overview

HadoopElephant

DB

Qu

ery

Incoming Data

Cassandra

A real-time architecture using Hadoop & Storm. 60

Lambda ArchitectureCan discard any view, batch and real time, and just recreate everything from the master data.

Mistakes are corrected via recomputation.- Write bad data? Remove the data & recompute.

- Bug in view generation? Just recompute the view.

Data storage is highly optimized.

A real-time architecture using Hadoop & Storm. 61

Recommendations

A real-time architecture using Hadoop & Storm. 62

Serialization & Schema

Catch errors as quickly as they happen. Validation on write vs on read.

A real-time architecture using Hadoop & Storm. 63

Serialization & Schema

CSV is actually a serialization language that is just poorly defined.

A real-time architecture using Hadoop & Storm. 64

Serialization & SchemaUse a format with a schema.- Thrift

- Avro

- Protobuffers

A real-time architecture using Hadoop & Storm. 65

Questions?

What are your needs?@nathan_gs & @gvanlandeghem

A real-time architecture using Hadoop & Storm. 66

DataCrunchers

We enable companies in envisioning, defining and implementing a data strategy.

A one-stop-shop for all your Big Data needs.

The first Big Data Consultancy agency in Belgium.

A real-time architecture using Hadoop & Storm. 67

Jobs

We are hiring.jobs@datacrunchers.eu