Date post: | 06-May-2015 |
Category: |
Technology |
Upload: | andrii-gakhov |
View: | 815 times |
Download: | 5 times |
Storm: overview
distributed and fault-tolerant realtimecomputation.
Backend Web Berlin
Stormwww.storm-project.net
Storm is a free and open source distributed
realtime computation system.
September BWB Meetup
Use cases
distributed RPC continuous computationsstream processing
Overview
• free and open source
• integrates with any queuing and
database system
• distributed and scalable
• fault-tolerant
• supports multiple languages
Scalable
Storm topologies are inherently parallel and run across a cluster of machines.
Different parts of the topology can be scaled individually by tweaking their
parallelism.
The "rebalance" command of the "storm" command line client can adjust the
parallelism of running topologies on the fly.
Fault tolerant
When workers die, Storm will automatically restart them.
If a node dies, the worker will be restarted on another node.
The Storm daemons, Nimbus and the Supervisors, are designed to be stateless
and fail-fast.
Guarantees data processing
Storm guarantees every tuple will be fully processed. One of Storm's core
mechanisms is the ability to track the lineage of a tuple as it makes its way
through the topology in an extremely efficient way.
Messages are only replayed when there are failures. Storm's basic abstractions
provide an at-least-once processing guarantee, the same guarantee you get
when using a queueing system.
Use with many languages
Storm was designed from the ground up to be usable with any programming
language.
Similarly, spouts and bolts can be defined in any language. Non-JVM spouts
and bolts communicate to Storm over a JSON-based protocol over
stdin/stdout.
Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl,
and PHP.
How Storm works? Storm cluster
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Nimbus
How Storm works? Basic concepts
TopologyTopology is a graph of computation. A topology runs forever, or until you kill it.
StreamStream is an unbounded sequence of tuples.
SpoutSpout is a source of streams.
BoltBolt is the place where calculations are done. Bolts can do anything from runfunctions, filter tuples, do streaming aggregations, joins, talk to databases etc.
How Storm works? Basic concepts
Worker processA worker process executes a subset of a topology. A worker process belongs toa specific topology and may run one or more executors for one or morecomponents (spouts or bolts) of this topology.
Executor (thread)Executor is a thread that is spawned by a worker process. It may run 1+ tasksfor the same component. It always has 1 thread that it uses for all of its tasks.
TaskTask performs the actual data processing – each spout or bolt that you implement inyour code executes as many tasks across the cluster. The number of tasks for acomponent is always the same throughout the lifetime of a topology.
How Storm works? Basic concepts
Spout
Task1
Task2
BoltATask1
Task2
Task3
BoltB
Task1
Task2
BoltC
Task1
Task2
Task3
Task4
Task5
Task6
BoltDTask1
Task2
Task3
BoltE
Task1
Task2
BoltF
Task1
How Storm works? Topology Exampleclass DemoTopology {
TopologyBuilder builder = new TopologyBuilder();builder.setSpout(“Spout", new DemoSpout(), 2).setNumTasks(2)
.declareDefaultStream("uid", "item").declareStream(“item_copy", “uid”, "item");builder.setBolt(“BoltA", new BoltA(), 2).setNumTasks(3).shuffleGrouping(“Spout“, “item_copy”);builder.setBolt(“BoltB", new BoltB(), 2).setNumTasks(2).shuffleGrouping(“Spout")
.declareDefaultStream("uid", “fromB");builder.setBolt(“BoltC", new BoltC(), 2).setNumTasks(6).shuffleGrouping(“BoltA")
.declareDefaultStream("uid", “fromC");builder.setBolt(“BoltD", new BoltD(), 3).setNumTasks(3).shuffleGrouping(“BoltC")
.fieldsGrouping( “BoltC", new Fields("uid")).fieldsGrouping( “BoltB", new Fields("uid")).declareStream("forD", "uid", "text").declareStream("forF", "uid", "text", "ne");
builder.setBolt(“BoltE", new BoltE(), 1).setNumTasks(2).shuffleGrouping(“BoltD“, “forE”);builder.setBolt(“BoltF", new BoltF(), 1).setNumTasks(1).shuffleGrouping(“BoltD“, “forF”);StormSubmitter.submitTopology(“demoTopology”, conf, builder.createTopology());
}
How Storm works? Spout Examplepublic class DemoSpout extends BaseRichSpout {
….@Overridepublic void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
_collector = collector;_queue = new MyFavoritQueue<string>();
}@Overridepublic void nextTuple() {
String nextItem = queue.poll();_collector.emit(new Values(nextItem));
}@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields(“item"));}
}
How Storm works? Bolt Examplepublic class BoltA extends BaseRichBolt {
private OutputCollector _collector;
@Overridepublic void execute(Tuple tuple) {
Object obj = tuple.getValue(0);String capitalizedItem = capitalize((String)obj);
_collector.emit(tuple, new Value(capitalizedItem));_collector.ack(tuple);
}
@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields(“item"));}
}
Storm UI
Read More about Storm• Stormhttp://storm-project.net/• Example Storm Topologieshttps://github.com/nathanmarz/storm-starter• Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithmhttp://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/• Understanding the Internal Message Buffers of Stormhttp://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/• Understanding the Parallelism of a Storm Topologyhttp://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
Storm in our company
ferret-go.com
Ferret go GmbH
Trend & Media Analyticsferret-go.com
Our data flow (simplified)
Google+
Blogs
Comments
Online media
Offline media
Reviews
Elas
tic S
earc
h
Elas
tic S
earc
h
Elas
tic S
earc
h
processing classification analyzing
Problem overview
• we have a number of streams that spout items
• for every item we do different calculations
• at the end of calculations we save item into
storage(s) – ElasticSearch, PostgreSQL etc.
• if processing fails because of some environment
issues, we want to re-queue item easily
• some of our calculations can be done in parallel
Google+
TwitterFacebook
Solution
• Redis-based queues for spouting
• 1-2 spouts per topology
• 1 bulk bolt for storage writing per worker
• Storm cluster with 2 nodes:32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04
• ~ 20 items per sec (could be increased)
• 3 slots per worker, 198 tasks, 68 executors
Thank you!30.10.2013
September BWB Meetup
Andrii Gakhov
Storm: overview
distributed and fault-tolerant realtimecomputation.
Backend Web Berlin
Stormwww.storm-project.net
Storm is a free and open source distributed
realtime computation system.
September BWB Meetup
Use cases
distributed RPC continuous computationsstream processing
Overview
• free and open source
• integrates with any queuing and
database system
• distributed and scalable
• fault-tolerant
• supports multiple languages
Scalable
Storm topologies are inherently parallel and run across a cluster of machines.
Different parts of the topology can be scaled individually by tweaking their
parallelism.
The "rebalance" command of the "storm" command line client can adjust the
parallelism of running topologies on the fly.
Fault tolerant
When workers die, Storm will automatically restart them.
If a node dies, the worker will be restarted on another node.
The Storm daemons, Nimbus and the Supervisors, are designed to be stateless
and fail-fast.
Guarantees data processing
Storm guarantees every tuple will be fully processed. One of Storm's core
mechanisms is the ability to track the lineage of a tuple as it makes its way
through the topology in an extremely efficient way.
Messages are only replayed when there are failures. Storm's basic abstractions
provide an at-least-once processing guarantee, the same guarantee you get
when using a queueing system.
Use with many languages
Storm was designed from the ground up to be usable with any programming
language.
Similarly, spouts and bolts can be defined in any language. Non-JVM spouts
and bolts communicate to Storm over a JSON-based protocol over
stdin/stdout.
Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl,
and PHP.
How Storm works? Storm cluster
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Nimbus
How Storm works? Basic concepts
TopologyTopology is a graph of computation. A topology runs forever, or until you kill it.
StreamStream is an unbounded sequence of tuples.
SpoutSpout is a source of streams.
BoltBolt is the place where calculations are done. Bolts can do anything from runfunctions, filter tuples, do streaming aggregations, joins, talk to databases etc.
How Storm works? Basic concepts
Worker processA worker process executes a subset of a topology. A worker process belongs toa specific topology and may run one or more executors for one or morecomponents (spouts or bolts) of this topology.
Executor (thread)Executor is a thread that is spawned by a worker process. It may run 1+ tasksfor the same component. It always has 1 thread that it uses for all of its tasks.
TaskTask performs the actual data processing – each spout or bolt that you implement inyour code executes as many tasks across the cluster. The number of tasks for acomponent is always the same throughout the lifetime of a topology.
How Storm works? Basic concepts
Spout
Task1
Task2
BoltATask1
Task2
Task3
BoltB
Task1
Task2
BoltC
Task1
Task2
Task3
Task4
Task5
Task6
BoltDTask1
Task2
Task3
BoltE
Task1
Task2
BoltF
Task1
How Storm works? Topology Exampleclass DemoTopology {
TopologyBuilder builder = new TopologyBuilder();builder.setSpout(“Spout", new DemoSpout(), 2).setNumTasks(2)
.declareDefaultStream("uid", "item").declareStream(“item_copy", “uid”, "item");builder.setBolt(“BoltA", new BoltA(), 2).setNumTasks(3).shuffleGrouping(“Spout“, “item_copy”);builder.setBolt(“BoltB", new BoltB(), 2).setNumTasks(2).shuffleGrouping(“Spout")
.declareDefaultStream("uid", “fromB");builder.setBolt(“BoltC", new BoltC(), 2).setNumTasks(6).shuffleGrouping(“BoltA")
.declareDefaultStream("uid", “fromC");builder.setBolt(“BoltD", new BoltD(), 3).setNumTasks(3).shuffleGrouping(“BoltC")
.fieldsGrouping( “BoltC", new Fields("uid")).fieldsGrouping( “BoltB", new Fields("uid")).declareStream("forD", "uid", "text").declareStream("forF", "uid", "text", "ne");
builder.setBolt(“BoltE", new BoltE(), 1).setNumTasks(2).shuffleGrouping(“BoltD“, “forE”);builder.setBolt(“BoltF", new BoltF(), 1).setNumTasks(1).shuffleGrouping(“BoltD“, “forF”);StormSubmitter.submitTopology(“demoTopology”, conf, builder.createTopology());
}
How Storm works? Spout Examplepublic class DemoSpout extends BaseRichSpout {
….@Overridepublic void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
_collector = collector;_queue = new MyFavoritQueue<string>();
}@Overridepublic void nextTuple() {
String nextItem = queue.poll();_collector.emit(new Values(nextItem));
}@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields(“item"));}
}
How Storm works? Bolt Examplepublic class BoltA extends BaseRichBolt {
private OutputCollector _collector;
@Overridepublic void execute(Tuple tuple) {
Object obj = tuple.getValue(0);String capitalizedItem = capitalize((String)obj);
_collector.emit(tuple, new Value(capitalizedItem));_collector.ack(tuple);
}
@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields(“item"));}
}
Storm UI
Read More about Storm• Stormhttp://storm-project.net/• Example Storm Topologieshttps://github.com/nathanmarz/storm-starter• Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithmhttp://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/• Understanding the Internal Message Buffers of Stormhttp://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/• Understanding the Parallelism of a Storm Topologyhttp://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
Storm in our company
ferret-go.com
Ferret go GmbH
Trend & Media Analyticsferret-go.com
Our data flow (simplified)
Google+
Blogs
Comments
Online media
Offline media
Reviews
Elas
tic S
earc
h
Elas
tic S
earc
h
Elas
tic S
earc
h
processing classification analyzing
Problem overview
• we have a number of streams that spout items
• for every item we do different calculations
• at the end of calculations we save item into
storage(s) – ElasticSearch, PostgreSQL etc.
• if processing fails because of some environment
issues, we want to re-queue item easily
• some of our calculations can be done in parallel
Google+
TwitterFacebook
Solution
• Redis-based queues for spouting
• 1-2 spouts per topology
• 1 bulk bolt for storage writing per worker
• Storm cluster with 2 nodes:32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04
• ~ 20 items per sec (could be increased)
• 3 slots per worker, 198 tasks, 68 executors
Thank you!30.09.2013
September BWB Meetup
Andrii Gakhov
Storm: overview
distributed and fault-tolerant realtimecomputation.
Backend Web Berlin
Stormwww.storm-project.net
Storm is a free and open source distributed
realtime computation system.
September BWB Meetup
Use cases
distributed RPC continuous computationsstream processing
Overview
• free and open source
• integrates with any queuing and
database system
• distributed and scalable
• fault-tolerant
• supports multiple languages
Scalable
Storm topologies are inherently parallel and run across a cluster of machines.
Different parts of the topology can be scaled individually by tweaking their
parallelism.
The "rebalance" command of the "storm" command line client can adjust the
parallelism of running topologies on the fly.
Fault tolerant
When workers die, Storm will automatically restart them.
If a node dies, the worker will be restarted on another node.
The Storm daemons, Nimbus and the Supervisors, are designed to be stateless
and fail-fast.
Guarantees data processing
Storm guarantees every tuple will be fully processed. One of Storm's core
mechanisms is the ability to track the lineage of a tuple as it makes its way
through the topology in an extremely efficient way.
Messages are only replayed when there are failures. Storm's basic abstractions
provide an at-least-once processing guarantee, the same guarantee you get
when using a queueing system.
Use with many languages
Storm was designed from the ground up to be usable with any programming
language.
Similarly, spouts and bolts can be defined in any language. Non-JVM spouts
and bolts communicate to Storm over a JSON-based protocol over
stdin/stdout.
Adapters that implement this protocol exist for Ruby, Python, Javascript, Perl,
and PHP.
How Storm works? Storm cluster
Zookeeper
Zookeeper
Zookeeper
Supervisor
Supervisor
Supervisor
Supervisor
Supervisor
Nimbus
How Storm works? Basic concepts
TopologyTopology is a graph of computation. A topology runs forever, or until you kill it.
StreamStream is an unbounded sequence of tuples.
SpoutSpout is a source of streams.
BoltBolt is the place where calculations are done. Bolts can do anything from runfunctions, filter tuples, do streaming aggregations, joins, talk to databases etc.
How Storm works? Basic concepts
Worker processA worker process executes a subset of a topology. A worker process belongs toa specific topology and may run one or more executors for one or morecomponents (spouts or bolts) of this topology.
Executor (thread)Executor is a thread that is spawned by a worker process. It may run 1+ tasksfor the same component. It always has 1 thread that it uses for all of its tasks.
TaskTask performs the actual data processing – each spout or bolt that you implement inyour code executes as many tasks across the cluster. The number of tasks for acomponent is always the same throughout the lifetime of a topology.
How Storm works? Basic concepts
Spout
Task1
Task2
BoltATask1
Task2
Task3
BoltB
Task1
Task2
BoltC
Task1
Task2
Task3
Task4
Task5
Task6
BoltDTask1
Task2
Task3
BoltE
Task1
Task2
BoltF
Task1
How Storm works? Topology Exampleclass DemoTopology {
TopologyBuilder builder = new TopologyBuilder();builder.setSpout(“Spout", new DemoSpout(), 2).setNumTasks(2)
.declareDefaultStream("uid", "item").declareStream(“item_copy", “uid”, "item");builder.setBolt(“BoltA", new BoltA(), 2).setNumTasks(3).shuffleGrouping(“Spout“, “item_copy”);builder.setBolt(“BoltB", new BoltB(), 2).setNumTasks(2).shuffleGrouping(“Spout")
.declareDefaultStream("uid", “fromB");builder.setBolt(“BoltC", new BoltC(), 2).setNumTasks(6).shuffleGrouping(“BoltA")
.declareDefaultStream("uid", “fromC");builder.setBolt(“BoltD", new BoltD(), 3).setNumTasks(3).shuffleGrouping(“BoltC")
.fieldsGrouping( “BoltC", new Fields("uid")).fieldsGrouping( “BoltB", new Fields("uid")).declareStream("forD", "uid", "text").declareStream("forF", "uid", "text", "ne");
builder.setBolt(“BoltE", new BoltE(), 1).setNumTasks(2).shuffleGrouping(“BoltD“, “forE”);builder.setBolt(“BoltF", new BoltF(), 1).setNumTasks(1).shuffleGrouping(“BoltD“, “forF”);StormSubmitter.submitTopology(“demoTopology”, conf, builder.createTopology());
}
How Storm works? Spout Examplepublic class DemoSpout extends BaseRichSpout {
….@Overridepublic void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
_collector = collector;_queue = new MyFavoritQueue<string>();
}@Overridepublic void nextTuple() {
String nextItem = queue.poll();_collector.emit(new Values(nextItem));
}@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields(“item"));}
}
How Storm works? Bolt Examplepublic class BoltA extends BaseRichBolt {
private OutputCollector _collector;
@Overridepublic void execute(Tuple tuple) {
Object obj = tuple.getValue(0);String capitalizedItem = capitalize((String)obj);
_collector.emit(tuple, new Value(capitalizedItem));_collector.ack(tuple);
}
@Overridepublic void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields(“item"));}
}
Storm UI
Read More about Storm• Stormhttp://storm-project.net/• Example Storm Topologieshttps://github.com/nathanmarz/storm-starter• Implementing Real-Time Trending Topics With a Distributed Rolling Count Algorithmhttp://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/• Understanding the Internal Message Buffers of Stormhttp://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/• Understanding the Parallelism of a Storm Topologyhttp://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
Storm in our company
ferret-go.com
Ferret go GmbH
Trend & Media Analyticsferret-go.com
Our data flow (simplified)
Google+
Blogs
Comments
Online media
Offline media
Reviews
Elas
tic S
earc
h
Elas
tic S
earc
h
Elas
tic S
earc
h
processing classification analyzing
Problem overview
• we have a number of streams that spout items
• for every item we do different calculations
• at the end of calculations we save item into
storage(s) – ElasticSearch, PostgreSQL etc.
• if processing fails because of some environment
issues, we want to re-queue item easily
• some of our calculations can be done in parallel
Google+
TwitterFacebook
Solution
• Redis-based queues for spouting
• 1-2 spouts per topology
• 1 bulk bolt for storage writing per worker
• Storm cluster with 2 nodes:32 Gb, CPU 4C-i7, Java 7, Ubuntu 12.04
• ~ 20 items per sec (could be increased)
• 3 slots per worker, 198 tasks, 68 executors
Thank you!30.09.2013
September BWB Meetup
Andrii Gakhov