Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with...

transcript

Real-time Analytics with Spark

Maciej Dabrowski, Chief Data Scientist, Altocloud !Galway Data Meetup, 2015-02-03

MEETS A SMALL STARTUP

source: https://media.licdn.com/mpr/mpr/p/1/005/0a0/167/2f98d60.jpg

‣ We built predictive communications software that uses analytics to make customer interactions and experience better

Altocloud

Monitoring live users

ANALYTICS

source: http://olap.com/

‣ Real-time for us is under 1-5s

‣ Q: How many customers are currently online?

‣ Q: How many chats/calls are taking place at the moment?

‣ Q: What is the utilisation of my customer support agents?

Use Case 1: Real-time analytics

‣ Q: How many calls were offered in the last week?

‣ Q: What is the acceptance rate of my chat offers?

Use Case 2: Reporting

‣ Q: Which customers currently on my site I should engage?

Use Case 3: Predictive Analytics

‣ Scalability

‣ Limited resources

‣ Various analytics use cases

Technical challenges

Real-time analytics with Hadoop

source: http://barbarashdwallpapers.com/funny-elephant-wallpapers/

QUERYING LAYER

STORAGE LAYER

PROCESSING LAYER

Altocloud Platform

MESSAGE QUEUES

FRONT-END APIs KAFKA

RABBIT MQ

CASSANDRA

SPARK STREAMING

BACK-END APIS

BACK-END APIs

MONGODB

DATA SOURCES

QUERYING LAYER

STORAGE LAYER

PROCESSING LAYER

Altocloud Data Platform

MESSAGE QUEUES

FRONT-END APIs KAFKA

MONGODB OPLOG

RABBIT MQ

CASSANDRA

SPARK STREAMING

FRONT-END APIS

MONGODB

‣ One code base for streaming and batch processing

‣ Rich API in Scala/Python/Java

‣ Fast for iterative algorithms (important for ML)

‣ Growing community

‣ The concept of a micro-batch

‣ Nicely integrates with Kafka and Cassandra

‣ Fairly easy setup

Why Spark

Spark components

‣ Hadoop

‣ Spark

Word count in Spark

‣ Example: user event aggregation stored in Cassandra

‣ Still much better than Hadoop!

What about something more useful?

‣ User activity is an input (e.g. page view)

‣ Users for multiple businesses online

‣ Scale 100s to 100 000s activities per second

‣ Response time under 5s

‣ A perfect use case for spark streaming

Counting users currently online

‣ Pub-sub message broker

‣ Fast: 100s MBs /s on a single broker

‣ Scalable: partitioned data streams

‣ Durable: messages persisted and replicated

‣ Distributed: Strong durability with and fault-tolerance

‣ Downside: requires ZooKeeper

!see https://kafka.apache.org

Data source: Kafka

!‣ Kafka with Spark: http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/

Spark and Kafka

‣ Simple count unique events

‣ Count visit events for unique users

Count users online

‣ Twitter Algebird to the rescue!

‣ HyperLogLog - a probabilistic data structure saving a lot of memory!

‣ https://github.com/twitter/algebird

Sets can take a lot of memory!

‣ Easy to setup

‣ High availability - no master

‣ Great performance

‣ CQL - SQL like querying

‣ Great support and bug-free drivers from Datastax

‣ Key: Design your schema around queries; !!

see https://cassandra.apache.org

Storing your results

‣ Datastax driver is very easy to use

‣ Save our results to Cassandra

Store data in Cassandra

25source: http://top1walls.com

‣ Spark streaming job performs two major tasks:

• data processing • data receiving

‣ Receiver always takes one core

‣ Technically, you need 2N cores to run N streaming jobs

‣ Not a big deal in production, what about testing?

Spark streaming

‣ Containerise your app including all its dependencies

‣ Distribute your app in this standard container

‣ Run it on any machine with docker

‣ Very lightweight

Docker

c3.xlarge: 4 cores

‣ AWS example

SPARK EXECUTOR

c3.large: 2 cores

SPARK DRIVER

SPARK EXECUTOR

CORE 1 CORE 2 CORE 3 CORE 4

c3.xlarge: 4 cores

‣ AWS example

Spark on Docker

c3.large: 2 cores

SPARK DRIVER

CORE 1 CORE 2 CORE 3 CORE 4

docker-1: 4 “cores”

SPARK EXECUTOR

C1 C2 C4C3

docker-2: 4 “cores”

SPARK EXECUTOR

C1 C2 C4C3

SPARK EXECUTOR

‣ Spark Streaming is fast to deploy but tuning is VERY important

‣ The lower the number of tasks, the better (in general)

‣ When reading from Kafka make sure that you configure blockingInterval

‣ optimize your jobs when possible - similar jobs can be sometimes merged

‣ persist your data from workers, NOT the driver

Spark Streaming

‣ OLAP-type queries using Spark SQL

‣ More advanced performance testing

‣ Detailed unit testing

‣ More batch jobs

Where do we go from here?

‣ Spark Documentation

‣ Reference application: http://github.com/killrweather/killrweather

‣ Productionalizing Spark Streaming

‣ Spark and Kafka

‣ Docker

‣ Free Hadoop Training from MapR

‣ Free edX course on Spark

Resources

Real-time analytics with Spark - Meetupfiles.meetup.com/18245106/Real-time analytics with...

Documents