BWB Meetup: Storm - distributed realtime computation system

Storm: overview

distributed and fault-tolerant realtimecomputation.

Backend Web Berlin

Stormwww.storm-project.net

Storm is a free and open source distributed

realtime computation system.

September BWB Meetup

Use cases

distributed RPC continuous computationsstream processing

Overview

• free and open source

• integrates with any queuing and

database system

• distributed and scalable

• fault-tolerant

• supports multiple languages

Scalable

Storm topologies are inherently parallel and run across a cluster of machines.

Different parts of the topology can be scaled individually by tweaking their

parallelism.

The "rebalance" command of the "storm" command line client can adjust the

parallelism of running topologies on the fly.

Fault tolerant

When workers die, Storm will automatically restart them.

If a node dies, the worker will be restarted on another node.

The Storm daemons, Nimbus and the Supervisors, are designed to be stateless

and fail-fast.

Guarantees data processing

Storm guarantees every tuple will be fully processed. One of Storm's core

mechanisms is the ability to track the lineage of a tuple as it makes its way

through the topology in an extremely efficient way.

Messages are only replayed when there are failures. Storm's basic abstractions

provide an at-least-once processing guarantee, the same guarantee you get

when using a queueing system.

Use with many languages

Storm was designed from the ground up to be usable with any programming

language.

Similarly, spouts and bolts can be defined in any language. Non-JVM spouts

and bolts communicate to Storm over a JSON-based protocol over

stdin/stdout.

Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl,

and PHP.

How Storm works? Storm cluster

Zookeeper

Zookeeper

Zookeeper

Supervisor

Supervisor

Supervisor

Supervisor

Supervisor

Nimbus

How Storm works? Basic concepts

TopologyTopology is a graph of computation. A topology runs forever, or until you kill it.

StreamStream is an unbounded sequence of tuples.

SpoutSpout is a source of streams.

BoltBolt is the place where calculations are done. Bolts can do anything from runfunctions, filter tuples, do streaming aggregations, joins, talk to databases etc.


Worker processA worker process executes a subset of a topology. A worker process belongs toa specific topology and may run one or more executors for one or morecomponents (spouts or bolts) of this topology.

Executor (thread)Executor is a thread that is spawned by a worker process. It may run 1+ tasksfor the same component. It always has 1 thread that it uses for all of its tasks.

TaskTask performs the actual data processing – each spout or bolt that you implement inyour code executes as many tasks across the cluster. The number of tasks for acomponent is always the same throughout the lifetime of a topology.


Spout

Task1

Task2

BoltATask1

Task2

Task3

BoltB

Task1

Task2

BoltC

Task1

Task2

Task3

Task4

Task5

Task6

BoltDTask1

Task2

Task3

BoltE

Task1

Task2

BoltF

Task1

How Storm works? Topology Exampleclass DemoTopology {

TopologyBuilder builder = new TopologyBuilder();builder.setSpout(“Spout", new DemoSpout(), 2).setNumTasks(2)

.declareDefaultStream("uid", "item").declareStream(“item_copy", “uid”, "item");builder.setBolt(“BoltA", new BoltA(), 2).setNumTasks(3).shuffleGrouping(“Spout“, “item_copy”);builder.setBolt(“BoltB", new BoltB(), 2).setNumTasks(2).shuffleGrouping(“Spout")

.declareDefaultStream("uid", “fromB");builder.setBolt(“BoltC", new BoltC(), 2).setNumTasks(6).shuffleGrouping(“BoltA")

.declareDefaultStream("uid", “fromC");builder.setBolt(“BoltD", new BoltD(), 3).setNumTasks(3).shuffleGrouping(“BoltC")

.fieldsGrouping( “BoltC", new Fields("uid")).fieldsGrouping( “BoltB", new Fields("uid")).declareStream("forD", "uid", "text").declareStream("forF", "uid", "text", "ne");

builder.setBolt(“BoltE", new BoltE(), 1).setNumTasks(2).shuffleGrouping(“BoltD“, “forE”);builder.setBolt(“BoltF", new BoltF(), 1).setNumTasks(1).shuffleGrouping(“BoltD“, “forF”);StormSubmitter.submitTopology(“demoTopology”, conf, builder.createTopology());

}

How Storm works? Spout Examplepublic class DemoSpout extends BaseRichSpout {

….@Overridepublic void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {

_collector = collector;_queue = new MyFavoritQueue<string>();

}@Overridepublic void nextTuple() {

String nextItem = queue.poll();_collector.emit(new Values(nextItem));

}@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {

declarer.declare(new Fields(“item"));}

}

How Storm works? Bolt Examplepublic class BoltA extends BaseRichBolt {

private OutputCollector _collector;

@Overridepublic void execute(Tuple tuple) {

Object obj = tuple.getValue(0);String capitalizedItem = capitalize((String)obj);

_collector.emit(tuple, new Value(capitalizedItem));_collector.ack(tuple);

}

@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {


}

Storm UI

Read More about Storm• Stormhttp://storm-project.net/• Example Storm Topologieshttps://github.com/nathanmarz/storm-starter• Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithmhttp://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/• Understanding the Internal Message Buffers of Stormhttp://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/• Understanding the Parallelism of a Storm Topologyhttp://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/

Storm in our company

ferret-go.com

Ferret go GmbH

Trend & Media Analyticsferret-go.com

Our data flow (simplified)

Twitter

Facebook

Google+

Blogs

Comments

Online media

Offline media

Reviews

Elas

tic S

earc

h

Elas

tic S

earc

h

Elas

tic S

earc

h

processing classification analyzing

Problem overview

• we have a number of streams that spout items

• for every item we do different calculations

• at the end of calculations we save item into

storage(s) – ElasticSearch, PostgreSQL etc.

• if processing fails because of some environment

issues, we want to re-queue item easily

• some of our calculations can be done in parallel

Google+

TwitterFacebook

Solution

• Redis-based queues for spouting

• 1-2 spouts per topology

• 1 bulk bolt for storage writing per worker

• Storm cluster with 2 nodes:32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04

• ~ 20 items per sec (could be increased)

• 3 slots per worker, 198 tasks, 68 executors

Thank you!30.10.2013


Andrii Gakhov

Storm: overview


Backend Web Berlin





Use cases


Overview



database system


• fault-tolerant


Scalable



parallelism.



Fault tolerant




and fail-fast.










language.



stdin/stdout.


and PHP.


Zookeeper

Zookeeper

Zookeeper

Supervisor

Supervisor

Supervisor

Supervisor

Supervisor

Nimbus











Spout

Task1

Task2

BoltATask1

Task2

Task3

BoltB

Task1

Task2

BoltC

Task1

Task2

Task3

Task4

Task5

Task6

BoltDTask1

Task2

Task3

BoltE

Task1

Task2

BoltF

Task1








}








}






}



}

Storm UI



ferret-go.com

Ferret go GmbH



Twitter

Facebook

Google+

Blogs

Comments

Online media

Offline media

Reviews

Elas

tic S

earc

h

Elas

tic S

earc

h

Elas

tic S

earc

h


Problem overview








Google+

TwitterFacebook

Solution









Andrii Gakhov

Storm: overview


Backend Web Berlin





Use cases


Overview



database system


• fault-tolerant


Scalable



parallelism.



Fault tolerant




and fail-fast.










language.



stdin/stdout.


and PHP.


Zookeeper

Zookeeper

Zookeeper

Supervisor

Supervisor

Supervisor

Supervisor

Supervisor

Nimbus











Spout

Task1

Task2

BoltATask1

Task2

Task3

BoltB

Task1

Task2

BoltC

Task1

Task2

Task3

Task4

Task5

Task6

BoltDTask1

Task2

Task3

BoltE

Task1

Task2

BoltF

Task1








}








}






}



}

Storm UI



ferret-go.com

Ferret go GmbH



Twitter

Facebook

Google+

Blogs

Comments

Online media

Offline media

Reviews

Elas

tic S

earc

h

Elas

tic S

earc

h

Elas

tic S

earc

h


Problem overview








Google+

TwitterFacebook

Solution









Andrii Gakhov

Date post:	06-May-2015
Category:	Technology
Upload:	andrii-gakhov
View:	815 times
Download:	5 times

BWB Meetup: Storm - distributed realtime computation system

Technology