Home > Data & Analytics > Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson

Date post: 21-Apr-2017
Category:
Author: spark-summit
View: 5,624 times
Download: 2 times
Share this document with a friend
Embed Size (px)
of 83 /83
Streaming Analytics with Spark, Kafka, Cassandra, and Akka Helena Edelson VP of Product Engineering @Tuplejump
Transcript
  • Streaming Analytics with Spark, Kafka, Cassandra, and Akka

    Helena Edelson VP of Product Engineering @Tuplejump

  • Committer / Contributor: Akka, FiloDB, Spark Cassandra Connector, Spring Integration

    VP of Product Engineering @Tuplejump

    Previously: Sr Cloud Engineer / Architect at VMware, CrowdStrike, DataStax and SpringSource

    [email protected]/helena

  • Tuplejump Open Source github.com/tuplejump

    FiloDB - distributed, versioned, columnar analytical db for modern streaming workloads

    Calliope - the first Spark-Cassandra integration Stargate - Lucene indexer for Cassandra SnackFS - HDFS-compatible file system for Cassandra

    3

  • What Will We Talk About The Problem Domain Example Use Case Rethinking Architecture

    We don't have to look far to look back Streaming Revisiting the goal and the stack Simplification

  • THE PROBLEM DOMAINDelivering Meaning From A Flood Of Data

    5

  • The Problem DomainNeed to build scalable, fault tolerant, distributed data processing systems that can handle massive amounts of data from disparate sources, with different data structures.

    6

  • TranslationHow to build adaptable, elegant systems for complex analytics and learning tasks to run as large-scale clustered dataflows

    7

  • How Much Data

    Yottabyte = quadrillion gigabytes or septillion bytes

    8

    We all have a lot of data Terabytes Petabytes...

    https://en.wikipedia.org/wiki/Yottabyte

  • Delivering Meaning Deliver meaning in sec/sub-sec latency Disparate data sources & schemas Billions of events per second High-latency batch processing Low-latency stream processing Aggregation of historical from the stream

  • While We Monitor, Predict & Proactively Handle

    Massive event spikes Bursty traffic Fast producers / slow consumers Network partitioning & Out of sync systems DC down Wait, we've DDOS'd ourselves from fast streams? Autoscale issues

    When we scale down VMs how do we not loose data?

  • And stay within our AWS / Rackspace budget

  • EXAMPLE CASE: CYBER SECURITY

    Hunting The Hunter

    12

  • 13

    Track activities of international threat actor groups, nation-state, criminal or hactivist Intrusion attempts Actual breaches

    Profile adversary activity Analysis to understand their motives, anticipate actions

    and prevent damage

    Adversary Profiling & Hunting

  • 14

    Machine events Endpoint intrusion detection Anomalies/indicators of attack or compromise

    Machine learning Training models based on patterns from historical data Predict potential threats profiling for adversary Identification

    Stream Processing

  • Data Requirements & Description Streaming event data

    Log messages User activity records System ops & metrics data

    Disparate data sources Wildly differing data structures

    15

  • Massive Amounts Of Data

    16

    One machine can generate 2+ TB per day Tracking millions of devices 1 million writes per second - bursty High % writes, lower % reads TTL

  • RETHINKING ARCHITECTURE

    17

  • WE DON'T HAVE TO LOOK FAR TO LOOK BACK

    18

    Rethinking Architecture

  • 19

    Most batch analytics flow from several years ago looked like...

  • STREAMING & DATA SCIENCE

    20

    Rethinking Architecture

  • StreamingI need fast access to historical data on the fly for predictive modeling with real time data from the stream.

    21

  • Not A Stream, A Flood Data emitters

    Netflix: 1 - 2 million events per second at peak 750 billion events per day

    LinkedIn: > 500 billion events per day Data ingesters

    Netflix: 50 - 100 billion events per day LinkedIn: 2.5 trillion events per day

    1 Petabyte of streaming data22

  • Which Translates To Do it fast Do it cheap Do it at scale

    23

  • Challenges Code changes at runtime Distributed Data Consistency Ordering guarantees Complex compute algorithms

    24

  • Oh, and don't lose data

    25

  • Strategies Partition For Scale & Data Locality Replicate For Resiliency Share Nothing Fault Tolerance Asynchrony Async Message Passing Memory Management

    26

    Data lineage and reprocessing in runtime Parallelism Elastically Scale Isolation Location Transparency

  • AND THEN WE GREEKED OUT

    27

    Rethinking Architecture

  • Lambda ArchitectureA data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods.

    28

  • Lambda ArchitectureA data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream processing methods.

    An approach Coined by Nathan Marz This was a huge stride forward

    29

  • 30https://www.mapr.com/developercentral/lambda-architecture

  • Implementing Is Hard

    32

    Real-time pipeline backed by KV store for updates Many moving parts - KV store, real time, batch Running similar code in two places Still ingesting data to Parquet/HDFS Reconcile queries against two different places

  • Performance Tuning & Monitoring: on so many systems

    33

    Also hard

  • Lambda ArchitectureAn immutable sequence of records is captured and fed into a batch system and a stream processing system in parallel.

    34

  • WAIT, DUAL SYSTEMS?

    35

    Challenge Assumptions

  • Which Translates To Performing analytical computations & queries in dual

    systems Implementing transformation logic twice Duplicate Code Spaghetti Architecture for Data Flows One Busy Network

    36

  • Why Dual Systems? Why is a separate batch system needed? Why support code, machines and running services of

    two analytics systems?

    37

    Counter productive on some level?

  • YES

    38

    A unified system for streaming and batch Real-time processing and reprocessing

    Code changes Fault tolerance

    http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html - Jay Kreps

  • ANOTHER ASSUMPTION: ETL

    39

    Challenge Assumptions

  • Extract, Transform, Load (ETL)

    40

    "Designing and maintaining the ETL process is often considered one of the most difficult and resource-intensive portions of a data warehouse project."

    http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm

  • Extract, Transform, Load (ETL)

    41

    ETL involves Extraction of data from one system into another Transforming it Loading it into another system

  • Extract, Transform, Load (ETL)"Designing and maintaining the ETL process is often

    considered one of the most difficult and resource-intensive portions of a data warehouse project."

    http://docs.oracle.com/cd/B19306_01/server.102/b14223/ettover.htm

    42

    Also unnecessarily redundant and often typeless

  • ETL

    43

    Each ETL step can introduce errors and risk Can duplicate data after failover Tools can cost millions of dollars Decreases throughput Increased complexity

  • ETL

    Writing intermediary files Parsing and re-parsing plain text

    44

  • And let's duplicate the pattern over all our DataCenters

    45

  • 46

    These are not the solutions you're looking for

  • REVISITING THE GOAL & THE STACK

    47

  • Removing The 'E' in ETLThanks to technologies like Avro and Protobuf we dont need the E in ETL. Instead of text dumps that you need to parse over multiple systems:

    Scala & Avro (e.g.)

    Can work with binary data that remains strongly typed

    A return to strong typing in the big data ecosystem

    48

  • Removing The 'L' in ETLIf data collection is backed by a distributed messaging system (e.g. Kafka) you can do real-time fanout of the ingested data to all consumers. No need to batch "load".

    From there each consumer can do their own transformations

    49

  • #NoMoreGreekLetterArchitectures

    50

  • NoETL

    51

  • Strategy Technologies

    Scalable Infrastructure / Elastic Spark, Cassandra, Kafka

    Partition For Scale, Network Topology Aware Cassandra, Spark, Kafka, Akka Cluster

    Replicate For Resiliency Spark,Cassandra, Akka Cluster all hash the node ring

    Share Nothing, Masterless Cassandra, Akka Cluster both Dynamo style

    Fault Tolerance / No Single Point of Failure Spark, Cassandra, Kafka

    Replay From Any Point Of Failure Spark, Cassandra, Kafka, Akka + Akka Persistence

    Failure Detection Cassandra, Spark, Akka, Kafka

    Consensus & Gossip Cassandra & Akka Cluster

    Parallelism Spark, Cassandra, Kafka, Akka

    Asynchronous Data Passing Kafka, Akka, Spark

    Fast, Low Latency, Data Locality Cassandra, Spark, Kafka

    Location Transparency Akka, Spark, Cassandra, Kafka

    My Nerdy Chart

    52

  • SMACK Scala/Spark Mesos Akka Cassandra Kafka

    53

  • Spark Streaming

    54

  • Spark Streaming One runtime for streaming and batch processing

    Join streaming and static data sets No code duplication Easy, flexible data ingestion from disparate sources to

    disparate sinks Easy to reconcile queries against multiple sources Easy integration of KV durable storage

    55

  • How do I merge historical data with data in the stream?

    56

  • Join Streams With Static Dataval ssc = new StreamingContext(conf, Milliseconds(500)) ssc.checkpoint("checkpoint")

    val staticData: RDD[(Int,String)] = ssc.sparkContext.textFile("whyAreWeParsingFiles.txt").flatMap(func) val stream: DStream[(Int,String)] = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topic -> n)) .transform { events => events.join(staticData)) .saveToCassandra(keyspace,table)

    ssc.start()

    57

  • Training Data

    Feature Extraction

    Model Training

    Model Testing

    Test Data

    Your Data Extract Data To Analyze

    Train your model to predict

    Spark MLLib

    58

  • Spark Streaming & ML

    59

    val context = new StreamingContext(conf, Milliseconds(500)) val model = KMeans.train(dataset, ...) // learn offline

    val stream = KafkaUtils .createStream(ssc, zkQuorum, group,..) .map(event => model.predict(event.feature))

  • Apache MesosOpen-source cluster manager developed at UC Berkeley.

    Abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively.

    60

  • AkkaHigh performance concurrency framework for Scala and Java Fault Tolerance Asynchronous messaging and data processing Parallelization Location Transparency Local / Remote Routing Akka: Cluster / Persistence / Streams

    61

  • Akka ActorsA distribution and concurrency abstraction Compute Isolation Behavioral Context Switching No Exposed Internal State Event-based messaging Easy parallelism Configurable fault tolerance

    62

  • 63

    Akka Actor Hierarchy

    http://www.slideshare.net/jboner/building-reactive-applications-with-akka-in-scala

  • import akka.actor._ class NodeGuardianActor(args...) extends Actor with SupervisorStrategy {

    val temperature = context.actorOf( Props(new TemperatureActor(args)), "temperature")

    val precipitation = context.actorOf( Props(new PrecipitationActor(args)), "precipitation")

    override def preStart(): Unit = { /* lifecycle hook: init */ }

    def receive : Actor.Receive = { case Initialized => context become initialized } def initialized : Actor.Receive = { case e: SomeEvent => someFunc(e) case e: OtherEvent => otherFunc(e) } }

    64

  • Apache Cassandra Extremely Fast Extremely Scalable Multi-Region / Multi-Datacenter Always On

    No single point of failure Survive regional outages

    Easy to operate Automatic & configurable replication 65

  • Apache Cassandra Very flexible data modeling (collections, user defined

    types) and changeable over time Perfect for ingestion of real time / machine data Huge community

    66

  • Spark Cassandra Connector

    NOSQL JOINS! Write & Read data between Spark and Cassandra Compatible with Spark 1.4 Handles Data Locality for Speed Implicit type conversions Server-Side Filtering - SELECT, WHERE, etc. Natural Timeseries Integration

    67

    http://github.com/datastax/spark-cassandra-connector

  • KillrWeather

    68

    http://github.com/killrweather/killrweather

    A reference application showing how to easily integrate streaming and batch data processing with Apache Spark Streaming, Apache Cassandra, Apache Kafka and Akka for fast, streaming computations on time series data in asynchronous event-driven environments.

    http://github.com/databricks/reference-apps/tree/master/timeseries/scala/timeseries-weather/src/main/scala/com/databricks/apps/weather

  • 69

    High Throughput Distributed Messaging Decouples Data Pipelines Handles Massive Data Load Support Massive Number of Consumers Distribution & partitioning across cluster nodes Automatic recovery from broker failures

  • Spark Streaming & Kafkaval context = new StreamingContext(conf, Seconds(1))

    val wordCount = KafkaUtils.createStream(context, ...) .flatMap(_.split(" ")) .map(x => (x, 1)) .reduceByKey(_ + _)

    wordCount.saveToCassandra(ks,table) context.start() // start receiving and computing

    70

  • 71

    class KafkaStreamingActor(params: Map[String, String], ssc: StreamingContext) extends AggregationActor(settings: Settings) { import settings._ val kafkaStream = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( ssc, params, Map(KafkaTopicRaw -> 1), StorageLevel.DISK_ONLY_2) .map(_._2.split(",")) .map(RawWeatherData(_)) kafkaStream.saveToCassandra(CassandraKeyspace, CassandraTableRaw)

    /** RawWeatherData: wsid, year, month, day, oneHourPrecip */ kafkaStream.map(hour => (hour.wsid, hour.year, hour.month, hour.day, hour.oneHourPrecip)) .saveToCassandra(CassandraKeyspace, CassandraTableDailyPrecip)

    /** Now the [[StreamingContext]] can be started. */ context.parent ! OutputStreamInitialized def receive : Actor.Receive = {} }

    Gets the partition key: Data LocalitySpark C* Connector feeds this to Spark

    Cassandra Counter column in our schema,no expensive `reduceByKey` needed. Simply let C* do it: not expensive and fast.

  • 72

    /** For a given weather station, calculates annual cumulative precip - or year to date. */class PrecipitationActor(ssc: StreamingContext, settings: WeatherSettings) extends AggregationActor { def receive : Actor.Receive = { case GetPrecipitation(wsid, year) => cumulative(wsid, year, sender) case GetTopKPrecipitation(wsid, year, k) => topK(wsid, year, k, sender) } /** Computes annual aggregation.Precipitation values are 1 hour deltas from the previous. */ def cumulative(wsid: String, year: Int, requester: ActorRef): Unit = ssc.cassandraTable[Double](keyspace, dailytable) .select("precipitation") .where("wsid = ? AND year = ?", wsid, year) .collectAsync() .map(AnnualPrecipitation(_, wsid, year)) pipeTo requester /** Returns the 10 highest temps for any station in the `year`. */ def topK(wsid: String, year: Int, k: Int, requester: ActorRef): Unit = { val toTopK = (aggregate: Seq[Double]) => TopKPrecipitation(wsid, year, ssc.sparkContext.parallelize(aggregate).top(k).toSeq) ssc.cassandraTable[Double](keyspace, dailytable) .select("precipitation") .where("wsid = ? AND year = ?", wsid, year) .collectAsync().map(toTopK) pipeTo requester }}

  • A New Approach One Runtime: streaming, scheduled Simplified architecture Allows us to

    Write different types of applications Write more type safe code Write more reusable code

    73

  • Need daily analytics aggregate reports? Do it in the stream, save results in Cassandra for easy reporting as needed - with data locality not offered by S3.

  • FiloDBDistributed, columnar database designed to run very fast analytical queries

    Ingest streaming data from many streaming sources Row-level, column-level operations and built in versioning

    offer greater flexibility than file-based technologies Currently based on Apache Cassandra & Spark github.com/tuplejump/FiloDB

    75

  • FiloDB Breakthrough performance levels for analytical queries

    Performance comparable to Parquet One to two orders of magnitude faster than Spark on

    Cassandra 2.x Versioned - critical for reprocessing logic/code changes Can simplify your infrastructure dramatically Queries run in parallel in Spark for scale-out ad-hoc analysis Space-saving techniques

    76

  • WRAPPING UP

    77

  • Architectyr?

    78"This is a giant mess"

    - Going Real-time - Data Collection and Stream Processing with Apache Kafka, Jay Kreps

  • 79

    Simplified

  • 80

  • 81

    www.tuplejump.com

    [email protected]@tuplejump

  • 82

    @helenaedelsongithub.com/helenaslideshare.net/helenaedelson

    THANK YOU!

  • I'm speaking at QCon SF on the broader topic of Streaming at Scale

    http://qconsf.com/sf2015/track/streaming-data-scale

    83


Recommended