8/19/2019 Storm Berkeley
1/91
Nathan Marz
Distributed and fault-tolerant realtime computationStorm
8/19/2019 Storm Berkeley
2/91
Basic info
• Open sourced September 19th
• Implementation is 15,000 lines of code
• Used by over 25 companies
• >2400 watchers on Github (most watched
JVM project)• Very active mailing list
• >1800 messages
• >560 members
8/19/2019 Storm Berkeley
3/91
Before Storm
QueuesWorkers
8/19/2019 Storm Berkeley
4/91
Example
(simplified)
8/19/2019 Storm Berkeley
5/91
Example
Workers schemify tweets
and append to Hadoop
8/19/2019 Storm Berkeley
6/91
Example
Workers update statistics on URLs by
incrementing counters in Cassandra
8/19/2019 Storm Berkeley
7/91
ScalingDeploy
Reconfigure/redeploy
8/19/2019 Storm Berkeley
8/91
Problems
• Scaling is painful
• Poor fault-tolerance
• Coding is tedious
8/19/2019 Storm Berkeley
9/91
What we want
• Guaranteed data processing
• Horizontal scalability
• Fault-tolerance
• No intermediate message brokers!
• Higher level abstraction than message passing
• “Just works”
8/19/2019 Storm Berkeley
10/91
Storm
Guaranteed data processing
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abstraction than message passing
“Just works”
8/19/2019 Storm Berkeley
11/91
Stream
processing
Continuous
computation
Distributed
RPC
Use cases
8/19/2019 Storm Berkeley
12/91
Storm Cluster
8/19/2019 Storm Berkeley
13/91
Storm Cluster
Master node (similar to Hadoop JobTracker)
8/19/2019 Storm Berkeley
14/91
Storm Cluster
Used for cluster coordination
8/19/2019 Storm Berkeley
15/91
Storm Cluster
Run worker processes
8/19/2019 Storm Berkeley
16/91
Starting a topology
8/19/2019 Storm Berkeley
17/91
Killing a topology
8/19/2019 Storm Berkeley
18/91
Concepts
• Streams
• Spouts
• Bolts
• Topologies
8/19/2019 Storm Berkeley
19/91
Streams
Unbounded sequence of tuples
Tuple Tuple Tuple Tuple Tuple Tuple Tuple
8/19/2019 Storm Berkeley
20/91
Spouts
Source of streams
8/19/2019 Storm Berkeley
21/91
Spout examples
• Read from Kestrel queue
• Read from Twitter streaming API
8/19/2019 Storm Berkeley
22/91
Bolts
Processes input streams and produces new streams
8/19/2019 Storm Berkeley
23/91
Bolts
• Functions
• Filters
• Aggregation
• Joins
• Talk to databases
8/19/2019 Storm Berkeley
24/91
Topology
Network of spouts and bolts
8/19/2019 Storm Berkeley
25/91
Tasks
Spouts and bolts execute as
many tasks across the cluster
8/19/2019 Storm Berkeley
26/91
Task execution
Tasks are spread across the cluster
8/19/2019 Storm Berkeley
27/91
Task execution
Tasks are spread across the cluster
8/19/2019 Storm Berkeley
28/91
Stream grouping
When a tuple is emitted, which task does it go to?
8/19/2019 Storm Berkeley
29/91
Stream grouping
• Shuffle grouping: pick a random task
• Fields grouping: mod hashing on a
subset of tuple fields
• All grouping: send to all tasks
• Global grouping: pick task with lowest id
8/19/2019 Storm Berkeley
30/91
Topology
shuffle
[“url”]
shuffle
shuffle
[“id1”, “id2”]
all
8/19/2019 Storm Berkeley
31/91
Streaming word count
TopologyBuilder is used to construct topologies in Java
8/19/2019 Storm Berkeley
32/91
Streaming word count
Define a spout in the topology with parallelism of 5 tasks
8/19/2019 Storm Berkeley
33/91
Streaming word count
Split sentences into words with parallelism of 8 tasks
8/19/2019 Storm Berkeley
34/91
Consumer decides what data it receives and how it gets grouped
Streaming word count
Split sentences into words with parallelism of 8 tasks
8/19/2019 Storm Berkeley
35/91
Streaming word count
Create a word count stream
8/19/2019 Storm Berkeley
36/91
Streaming word count
splitsentence.py
8/19/2019 Storm Berkeley
37/91
Streaming word count
8/19/2019 Storm Berkeley
38/91
Streaming word count
Submitting topology to a cluster
8/19/2019 Storm Berkeley
39/91
Streaming word count
Running topology in local mode
8/19/2019 Storm Berkeley
40/91
Demo
8/19/2019 Storm Berkeley
41/91
Distributed RPC
Data flow for Distributed RPC
8/19/2019 Storm Berkeley
42/91
DRPC Example
Computing “reach” of a URL on the fly
8/19/2019 Storm Berkeley
43/91
Reach
Reach is the number of unique people
exposed to a URL on Twitter
8/19/2019 Storm Berkeley
44/91
Computing reach
URL
Tweeter
Tweeter
Tweeter
Follower
Follower
Follower
Follower
Follower
Follower
Distinct
follower
Distinct
follower
Distinct
follower
Count Reach
8/19/2019 Storm Berkeley
45/91
Reach topology
8/19/2019 Storm Berkeley
46/91
Reach topology
8/19/2019 Storm Berkeley
47/91
Reach topology
8/19/2019 Storm Berkeley
48/91
Reach topology
Keep set of followers for
each request id in memory
8/19/2019 Storm Berkeley
49/91
Reach topology
Update followers set when
receive a new follower
8/19/2019 Storm Berkeley
50/91
Reach topology
Emit partial count afterreceiving all followers for a
request id
8/19/2019 Storm Berkeley
51/91
Demo
8/19/2019 Storm Berkeley
52/91
Guaranteeing message
processing
“Tuple tree”
8/19/2019 Storm Berkeley
53/91
Guaranteeing message
processing
• A spout tuple is not fully processed until all
tuples in the tree have been completed
8/19/2019 Storm Berkeley
54/91
Guaranteeing message
processing
• If the tuple tree is not completed within a
specified timeout, the spout tuple is replayed
8/19/2019 Storm Berkeley
55/91
Guaranteeing message
processing
Reliability API
G
8/19/2019 Storm Berkeley
56/91
Guaranteeing message
processing
“Anchoring” creates a new edge in the tuple tree
G
8/19/2019 Storm Berkeley
57/91
Guaranteeing message
processing
Marks a single node in the tree as complete
8/19/2019 Storm Berkeley
58/91
8/19/2019 Storm Berkeley
59/91
Transactional topologies
How do you do idempotent counting with an
at least once delivery guarantee?
8/19/2019 Storm Berkeley
60/91
Won’t you overcount?
Transactional topologies
8/19/2019 Storm Berkeley
61/91
Transactional topologies solve this problem
Transactional topologies
8/19/2019 Storm Berkeley
62/91
Built completely on top of Storm’s primitives
of streams, spouts, and bolts
Transactional topologies
8/19/2019 Storm Berkeley
63/91
Batch 1 Batch 2 Batch 3
Transactional topologies
Process small batches of tuples
8/19/2019 Storm Berkeley
64/91
Batch 1 Batch 2 Batch 3
Transactional topologies
If a batch fails, replay the whole batch
8/19/2019 Storm Berkeley
65/91
Batch 1 Batch 2 Batch 3
Transactional topologies
Once a batch is completed, commit the batch
8/19/2019 Storm Berkeley
66/91
Batch 1 Batch 2 Batch 3
Transactional topologies
Bolts can optionally be “committers”
8/19/2019 Storm Berkeley
67/91
Commit 1
Transactional topologies
Commits are ordered. If there’s a failure during
commit, the whole batch + commit is retried
Commit 1 Commit 2 Commit 3 Commit 4 Commit 4
8/19/2019 Storm Berkeley
68/91
Example
8/19/2019 Storm Berkeley
69/91
Example
New instance of this objectfor every transaction attempt
8/19/2019 Storm Berkeley
70/91
Example
Aggregate the count for
this batch
8/19/2019 Storm Berkeley
71/91
Example
Only update database if
transaction ids differ
8/19/2019 Storm Berkeley
72/91
Example
This enables idempotency since
commits are ordered
8/19/2019 Storm Berkeley
73/91
Example
(Credit goes to Kafka devs
for this trick)
8/19/2019 Storm Berkeley
74/91
Transactional topologies
Multiple batches can be processed in parallel,
but commits are guaranteed to be ordered
8/19/2019 Storm Berkeley
75/91
• Will be available in next version of Storm
(0.7.0)
• Requires a source queue that can replay
identical batches of messages
•storm-kafka has a transactional spoutimplementation for Kafka
Transactional topologies
8/19/2019 Storm Berkeley
76/91
Storm UI
8/19/2019 Storm Berkeley
77/91
Storm on EC2
https://github.com/nathanmarz/storm-deploy
One-click deploy tool
https://github.com/nathanmarz/storm-deployhttps://github.com/nathanmarz/storm-deploy
8/19/2019 Storm Berkeley
78/91
Starter code
https://github.com/nathanmarz/storm-starter
Example topologies
https://github.com/nathanmarz/storm-deployhttps://github.com/nathanmarz/storm-deploy
8/19/2019 Storm Berkeley
79/91
Documentation
8/19/2019 Storm Berkeley
80/91
Ecosystem
• Scala, JRuby, and Clojure DSL’s
• Kestrel, AMQP, JMS, and other spout adapters
• Serializers
• Multilang adapters
• Cassandra, MongoDB integration
8/19/2019 Storm Berkeley
81/91
Questions?
http://github.com/nathanmarz/storm
http://github.com/nathanmarz/stormhttp://github.com/nathanmarz/storm
8/19/2019 Storm Berkeley
82/91
Future work
• State spout
• Storm on Mesos
• “Swapping”
• Auto-scaling
• Higher level abstractions
8/19/2019 Storm Berkeley
83/91
Implementation
KafkaTransactionalSpout
8/19/2019 Storm Berkeley
84/91
Implementation
all
all
all
8/19/2019 Storm Berkeley
85/91
Implementation
all
all
all
TransactionalSpout is a subtopology
consisting of a spout and a bolt
8/19/2019 Storm Berkeley
86/91
Implementation
all
all
all
The spout consists of one task that
coordinates the transactions
8/19/2019 Storm Berkeley
87/91
Implementation
all
all
all
The bolt emits the batches of tuples
8/19/2019 Storm Berkeley
88/91
Implementation
all
all
all
The coordinator emits a “batch” stream
and a “commit stream”
8/19/2019 Storm Berkeley
89/91
Implementation
all
all
all
Batch stream
8/19/2019 Storm Berkeley
90/91
Implementation
all
all
all
Commit stream
8/19/2019 Storm Berkeley
91/91
Implementation
all
all
all
Coordinator reuses tuple tree framework to detect