Realtime Distributed Analysis of Datastreams

Post on 10-May-2015

512 views 1 download

Tags:

description

Ein Vortrag von Philipp Nolte aus dem Hauptseminar "Personalisierung mit großen Daten".

transcript

RealtimeDistributed Analysis

of Datastreams

Philipp Nolte – University of Passau – January 2014

1

Learn

Why we need fancy Big Data frameworks.

How the lambda architecture looks like.

How twitter used to do real-time analytics.

Why twitter created Storm.

How Storm works.

2

Limits

Imagine a traditional web analytics software:

Every page view incrementsthe url’s database row.

3

First Aid

Queue your writes and write in batches.

Shard your data: Partition horizontally.

4

Chronic Issues

Fault-tolerance is hard.

Applications become more and more complex.

You have to do all the work.

5

New Tools

Large scale computation systems such as Hadoop.

Scalable databases such as Casandra and Riak.

Easy to use frameworks such as Storm and Dempsy.

6

Lambda Architecture

Speed Layer

Serving Layer

Batch Layer

Theoretical, abstract architecture for working with big data.

7

Goal

Compute arbitrary functions on arbitrary data.

query = function ( all data )

8

Properties

Robust and fault-tolerant.

Low latency reads and updates.

Scalable.

Minimal maintenance.

9

Batch Layer

Stores the immutable master dataset.

Precomputes arbitrary batch views.

Home of batch processing and mapreduce systems such as Hadoop.

Speed Layer

Serving Layer

Batch Layer

10

Serving Layer

Read-only random-access to batch views.

Updated by batch layer.

Indexes batch views.

Home of real-time query systemssuch as Cloudera Impala for Hadoop.

Speed Layer

Serving Layer

Batch Layer

11

Speed Layer

Compensates for high-latency batch views.

Fast, incremental algorithms.

More complex because of random-writes.

Home of Apache HBase or Storm.

Speed Layer

Serving Layer

Batch Layer

12

Lambda Architecture

Data

Speed Layer

Serving Layer

Batch Layer

QueryBatch Views

Realtime Views

13

Available Data

Batch View Realtime View

Batch View Realtime View

Discard Realtime Viewas soon as it is represented

in the batch view.Time

14

Twitter’s Early DaysWorker

Worker

Worker

Worker

Queue

Queue

Hadoop Cassandra

Tweets

Map

URLs

Queue

Queue

Queue

Queue

Worker

Worker

Worker

Worker

15

StormGuaranteed message processing without

message brokers.

Horizontal scalability.

Fault-tolerance.

High level of abstraction.

Just works.

16

Storm Topologies

Spout

Spout

⚡️Bolt

⚡️Bolt

⚡️Bolt

⚡️Bolt

Stream

17

Parallel Tasks

Spout

Spout

⚡️Bolt

⚡️Bolt

⚡️Bolt

⚡️Bolt

StreamT

Task

T T T T

TTTTTTT

18

Demo

Storm in action

19

Know

Why we need fancy Big Data frameworks.

How the lambda architecture looks like.

How twitter used to do real-time analytics.

Why twitter created Storm.

How Storm works.

20

The End.

Questions?

21