+ All Categories
Home > Technology > All Things Open 2015 - Spark & Storm: When & Where?

All Things Open 2015 - Spark & Storm: When & Where?

Date post: 16-Apr-2017
Category:
Upload: mammoth-data
View: 2,242 times
Download: 1 times
Share this document with a friend
52
Spark & Storm: When & Where?
Transcript

Spark & Storm: When & Where?

www.mammothdata.com | @mammothdataco

The Leader in Big Data Consulting

● BI/Data Strategy○ Development of a business intelligence/ data architecture strategy.

● Installation○ Installation of Hadoop or relevant technology.

● Data Consolidation○ Load data from diverse sources into a single scalable repository.

● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions.

● Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to

necessary employees who will analyze the data.

Mammoth Data, based in downtown Durham (right above Toast)

www.mammothdata.com | @mammothdataco

● Lead Consultant on all things DevOps and Spark

● @carsondial on Twitter

Me!

www.mammothdata.com | @mammothdataco

● Quick overview of Spark Streaming

● Reasons why Spark Streaming can be tricky in practice

● Performance and tuning tips we’ve learnt over the past two years

● …and when to pack it all in and use Storm instead

What This Talk Is About

www.mammothdata.com | @mammothdataco

This IS WEB SCALE!

www.mammothdata.com | @mammothdataco

● I kid, Rails!

● (mostly)

Beyond Web Scale

www.mammothdata.com | @mammothdataco

● Spark & Storm - millions of requests / second on commodity hardware

● Different problems at different scales!

Beyond Web Scale

www.mammothdata.com | @mammothdataco

● Directed Acyclic Graph Data Processing Engine

● Based around the Resilient Distributed Dataset (RDD) primitive

Spark

www.mammothdata.com | @mammothdataco

Spark Streaming — Overview

www.mammothdata.com | @mammothdataco

Spark Streaming — In Production?

● Yes!

● (Alibaba, AutoTrader, Cisco, Netflix, etc.)

www.mammothdata.com | @mammothdataco

● Streaming by running batches very quickly!

● Batch length: can be as low as 0.5s / batch

● Every X seconds, get Y records (DStream/RDDs)

Spark Streaming — Overview

www.mammothdata.com | @mammothdataco

● Using same implementation (mostly) for batch and stream processing (Lambda Architecture hipster points ahoy!)

● Access to rest of Spark - Dataframes, MLLib, GraphX, etc.

Spark Streaming — Good Things

www.mammothdata.com | @mammothdataco

● What happens if you can’t process Y records in X seconds?

● What happens if you require sub-second latency?

Spark Streaming — Bad Things!

www.mammothdata.com | @mammothdataco

Spark Streaming — I’m so sorry.

www.mammothdata.com | @mammothdataco

● What happens if you can’t process Y records in X seconds?

● Data builds up in executors

● Executors run out of memory…

Spark Streaming — Bad Things!

www.mammothdata.com | @mammothdataco

● “Hey, we forgot to tell you Ops people that we have a major new client adding stuff into the firehose sometime today. That’s fine, right?”

Spark Streaming — Bad Things!

www.mammothdata.com | @mammothdataco

Spark Streaming — It Will Be Okay

www.mammothdata.com | @mammothdataco

● As a former Ops person:

● WE WILL REMEMBER.

Spark Streaming — Bad Things!

www.mammothdata.com | @mammothdataco

● Do you need low-latency?

● If so, a 10-minute nap is advisable!

● Everybody else, let’s dive in…

Spark Streaming — Tuning

www.mammothdata.com | @mammothdataco

Spark Streaming — Tuning

www.mammothdata.com | @mammothdataco

Spark Streaming — Down In The Hole

www.mammothdata.com | @mammothdataco

Spark Streaming — Down In The Hole

www.mammothdata.com | @mammothdataco

● Easiest method — alter the batch window until it’s all fine!

● Tiny batches provide tight execution times!

Spark Streaming — Down In The Hole

www.mammothdata.com | @mammothdataco

● Use Kafka.

● Data source with the most love (e.g. exactly-once semantics without Write Ahead Logs and receiver-less operation in 1.3+)

● (other sources get the features…eventually)

Spark Streaming — Tuning

www.mammothdata.com | @mammothdataco

● Use Scala.

● CPython = slower in execution

● PyPy is much faster…but…

● New features always come to Scala first.

Spark Streaming — Tuning

www.mammothdata.com | @mammothdataco

● (or Java if you really must)

Spark Streaming — Tuning

www.mammothdata.com | @mammothdataco

● Spark Streaming = data receivers + Spark

● spark.cores.max = x * number of receivers

● For Great Data Locality and Parallelism!

Spark Streaming — Cores

www.mammothdata.com | @mammothdataco

● Are you using a foreachRDD loop?

rdd.foreachRDD{ rdd =>

rdd.cache()

…rdd.unpersist()

}

Spark Streaming — Caching

www.mammothdata.com | @mammothdataco

● If routing to multiple stores / iterating over an RDD multiple times using cache() is a quick win

● It really shouldn’t work so well…

Spark Streaming — Caching

www.mammothdata.com | @mammothdataco

● Hurrah for Spark 1.5!

● spark.streaming.backpressure.enabled = true

● Spark dynamically alters incoming data rates (keeping the data in Kafka rather than in the executors)

● Works for all data sources (for once!)

Spark Streaming — Backpressure

www.mammothdata.com | @mammothdataco

● I really need that low-latency response!

Storm

www.mammothdata.com | @mammothdataco

● Directed Acyclic Graph Data Processing Engine

Storm

www.mammothdata.com | @mammothdataco

Spark

“Very Good, Sir”

www.mammothdata.com | @mammothdataco

Storm

“Here you go!”

www.mammothdata.com | @mammothdataco

● Stream of tuples

● Bolts

● Spouts

● Topologies

Storm Concepts

www.mammothdata.com | @mammothdataco

● Unbounded stream of tuples

● Tuples are defined via schema (usual base types plus custom serializers)

Storm — Streams

www.mammothdata.com | @mammothdataco

● Sources of tuples in a topology

● Read from external sources (e.g. Kafka) and emitting them

● Can emit multiple streams from a spout!

Storm — Spouts

www.mammothdata.com | @mammothdataco

● Where your processing happens● Roll your own aggregations / filtering / windowing● Bolts can feed into other bolts● Potentially easier to test than Spark Streaming● Many Bolt connectors for external sources (e.g. Cassandra,

Redis, Hive, etc)

Storm — Bolts

www.mammothdata.com | @mammothdataco

● The DAG of the spouts and bolts

● Built programmatically in code and submitted to the Storm cluster

● Flux - Do It In YAML (and then complain about whitespace)

Storm — Topologies

www.mammothdata.com | @mammothdataco

● Each bolt or spout runs 'tasks' across the cluster

● How parallelism works in Storm

● Set in topology submission

Storm — Tasks

www.mammothdata.com | @mammothdataco

● Where the topology runs

● 1 worker = 1 JVM

● Tasks run as threads on a worker

● Storm distributes tasks evenly across cluster

Storm — Workers

www.mammothdata.com | @mammothdataco

● True Streaming

● Tuples processed as they enter topology - low latency

● Scales far beyond Spark Streaming (currently)

Storm — Good Things

www.mammothdata.com | @mammothdataco

● Battle-tested at Twitter & Yahoo!

● Yahoo! has 300-node clusters and working to support 1000+ nodes

● Single node clocked at over 1.5m tuples / second at Twitter

Storm — Good Things

www.mammothdata.com | @mammothdataco

● Very DIY (bring your own aggregations, ML, etc)

● Your DAG construction may not be optimal

● Operationally more complex (and Storm WebUI is more primitive)

● Where’s Me REPL?

Storm — Bad Things

www.mammothdata.com | @mammothdataco

Spark or Storm?

www.mammothdata.com | @mammothdataco

● SLA on latency?

Spark or Storm?

www.mammothdata.com | @mammothdataco

● Storm!

● (though simply because it’s possible doesn’t mean you’ll get it!)

Spark or Storm?

www.mammothdata.com | @mammothdataco

● Insane data needs (e.g. ~100m records/second?)

Spark or Storm?

www.mammothdata.com | @mammothdataco

● Storm!

● (though, again, it’s not a magic bullet!)

Spark or Storm?

www.mammothdata.com | @mammothdataco

● For almost anything else? Spark.

● High-level vs. Low-level

● Each new version of Spark delivers improvements!

Spark or Storm?

www.mammothdata.com | @mammothdataco

● Other frameworks that show promise:○ Flink○ Apex○ Samza○ Heron (Twitter’s not-public Storm replacement)

Other Listing Magazines Are Available


Recommended