Home > Technology > Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data Troika

Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data Troika

Date post: 11-Feb-2017
Category:
Author: datastax-academy
View: 1,009 times
Download: 0 times
Share this document with a friend
Embed Size (px)
of 43 /43
Dean Wampler (Typesafe), Patrick Di Loreto (William Hill) Cassandra, Spark and Kafka: The Streaming Data Troika
Transcript
  • Dean Wampler (Typesafe), Patrick Di Loreto (William Hill)

    Cassandra, Spark and Kafka:

    The Streaming Data Troika

  • 2

    About Typesafe

    Typesafe Reactive Platform Akka, Play, and Spark, for Scala and Java. typesafe.com/reactive-big-data

  • 3

    Whats Reactive?Responsive

    Elastic Resilient

    Message Driven

  • 4

    About

    Online Sportsbook and Gaming provider

    Every day we push more than 5 millions price changes

    160TB of data flowing through our platform each day

  • We're Hiringhttps://careers.williamhill.com

    WH Apple Watch App Interactive Scoreboard Virtual Reality Horse RaceOculus Rift

  • 6

    Big Data Circa 2010

  • 7

    Big Data Circa 2010

    Generally two camps. One was the offline, batch-mode processing of massive data sets done with Hadoop.

  • 8

    Big Data Circa 2010

    Akka

    The other was the online, real-time processing and storage of data of transactional data at scale, as exemplified by Cassandra for the data store and middleware tools and libraries like Akka, Spring, etc.

  • 9

    Big Data Circa 2010

    Akka?

    Two camps together with some overlap and connectivity, but not a lot.

  • 10

    Big Data Circa 2015

  • 11

    Big Data Circa 2015We still have this:

    Akka?

    Five years later (this year), we still have these architectures in wide use, but

  • 12

    Big Data Circa 2015But now we have this:

    Big Data Streaming

    Mesos, EC2, or Bare

    A new, streaming-oriented architecture is emerging, which can also be used for batch mode analysis, if we process resident data sets as finite streams.

  • Topic A

    General Principles Spark Streaming: Analytics/aggregations C*: Storage, queries Kafka: durable message store; allows

    replay of messages lost downstream.

    Spark Streaming provides rich analytics.

    Need a durable system of record, like Kafka, which allows repeat reads in case of loss. See https://medium.com/@foundev/real-time-analytics-with-spark-streaming-and-cassandra-2f90d03342f7 for a nice summary of design patterns and tips.

  • Mesos, EC2, or Bare Metal

    14

    Lets explore this.

  • Mesos, EC2, or Bare Metal

    15

    Cassandra remains the flexible, scalable datastore suitable for scalable ingesting of streaming data, such as event streams (e.g., click streams from web apps) and logs.

  • Mesos, EC2, or Bare Metal

    16

    Kafka is growing popular as a tool for durable ingestion of diverse event streams with partitioning for scale and organization into topics (like a typical message queue) for downstream consumers.

  • Service 1

    Log & Other Files

    Internet

    Services

    Service 2

    Service 3

    Services

    Services

    N * M links ConsumersProducers

    One use of Kafka is to solve the problem of N*M direct links between producers and consumers. This is hard to manage and it couples services to directly, which is fragile when a given service needs to be scaled up through replication or replacement and sometimes in the protocol that both ends need to speak.

  • Service 1

    Log & Other Files

    Internet

    Services

    Service 2

    Service 3

    Services

    Services

    N + M links ConsumersProducers

    So Kafka can function as a central hub, yet its distributed and scalable so it isnt a bottleneck or single point of failure.

  • n+5

    n+4

    n+3

    n+2

    n+1 n

    Consumer 1

    Producer 1

    Producer 2

    n+?

    n+?

    Consumer 2

    Kafka Usage

    Topic A

    The message queue structure looks basically like this. Where different producers can write to append messages to a topic and different consumers can read the messages in the queue at their own pace, in order.

  • Kafka Resiliency

    Data loss downstream? Can replay lost messages.

    Could use C* for this, but then youve changed the read/write load (and hence tuning, scaling, etc. of your C* ring).

  • Mesos, EC2, or Bare Metal

    21

    The third element of the troika is Spark, the next generation, scalable compute engine that is replacing MapReduce in Hadoop. However, Spark is flexible enough to run in many cluster configurations, including a local mode for development, a simple standalone cluster mode for simple scenarios, Mesos for general scalability and flexibility, and integrated with Cassandra itself.

  • Topic A

    Spark Streaming Dos/Donts

    Do Use for rich analytics and aggregations. Use with Kafka/C* source if data loss not

    tolerable. Or, use the write ahead log (WAL) - less optimal.

    Spark Streaming offers rich analytics, even SQL, machine learning, and graph representations. Its a more complex engine, so there is more room for data loss. Hence, use Kafka or C* for durability and replay capabilities, but if you do ingest data directly from other sources without replay capability, at least use the WAL.

  • Topic A

    Spark Streaming Donts

    Dont Use for counting (use C*). Low-latency, per-event processing.

    C* is faster and more accurate for counting, because repeat execution of Spark tasks (for error recovery, speculative execution, etc.) will cause over-counting (e.g., using the aggregator feature). Also, Spark is a mini-batch system, for processing time slices of events (down to ~1 sec.). If you need low-latency and/or per-event processing, use Akka

  • Mesos, EC2, or Bare Metal

    24

    Other parts of complete infrastructure include a distributed file system like CSFv2, when you dont need a full database, e.g., for logs that youll dump into the file system and then process in batches later on with Spark.

  • Mesos, EC2, or Bare Metal

    25

    Typesafe Reactive Platform provides infrastructure tools for integrating these and other components, including Akka Streams for resilient, low-latency event processing (based on the Reactive Streams standard for streams with dynamic back pressure), ConductR for orchestrating services, and Play for web services and consoles.

  • Topic A

    Typesafe Reactive Platform Akka Streams: low-latency, per-event

    processing. ConductR for orchestrating services. Play for web services, consoles. and commercial Spark support.

    Akka Streams implements the Reactive Streams standard for streams with dynamic back pressure. It sits on top of the more general Akka Actor framework for highly distributed concurrent applications.

    Typesafe offers commercial support for development teams developing advanced Spark applications. We offer production runtime support for Spark running on Mesos clusters.

  • Mesos, EC2, or Bare Metal

    27

    Finally, theres a wealth of cluster systems possible. You could deploy these tools on your servers for you Cassandra Ring, which has an excellent integration with Spark. You can run in EC2 or bare metal. You can use a general-purpose cluster management system like Mesos.

  • Presented by Patrick Di Loreto R&D Engineering Lead

    Site: https://developer.williamhill.comTwitter: https://twitter.com/patricknoir

    OMNIA

    Distributed & Reactive platform for data management

    https://developer.williamhill.comhttps://twitter.com/patricknoir

  • Motivations

    29Omnia: Distributed & Reactive platform for data management

    Users

    Feeds

    System

    3 Party

    In order to be in a position to innovate we need to control and understand our data

    Social Networks

    IoT

    William Hill

    Need for control over the data

  • DMP based on the Lambda architecture and the Reactive principles

    What is Omnia?

    30

    Chr

    onos

    Dat

    a So

    urce

    NeoCortexSpeed Layer

    FatesBatch Layer

    Her

    mes

    Serv

    ing

    Laye

    r

    Data Flow

    Input Output

    Omnia: Distributed & Reactive platform for data management

    Lambda architecture

  • Reactive principles

    31

    Responsive

    Resilient

    Message Driven

    Elastic

    The Reactive Manifesto http://www.reactivemanifesto.org/

    Omnia: Distributed & Reactive platform for data management

    Reactive Manifesto

    http://www.reactivemanifesto.org/http://www.reactivemanifesto.org/

  • Chronos is a reliable and scalable component which collect data from different sources and organize them into Streams of observable events.

    Chronos: Data acquisition

    32

    Incident: { type: bet, version: 1.0, time: 2015-09-03 06:00:10, acquisitionTime: 2015-09-03 06:00:06, source: BetSystem, payload: {. Any valid JSON}}

    Omnia: Distributed & Reactive platform for data management

    Chr

    onos

    Dat

    a So

    urce

    TCP

    HTTP

    WS

    JMS

    HTTP Poll

    SSE

    AdapterStreams

    Converter Persistence

    Bets

    Depo

    sits

    Pric

    es

    Stream = Adapter + Converter + Persistence

  • Chronos: Data acquisition

    33Omnia: Distributed & Reactive platform for data management

    Chronos 1(SSE, Bets placed)

    Chronos 2(JMS, Deposits)

    Chronos 3(HTTP, Events)

    Chronos N(SSE, Twitter)

    .

    Chronos 2(JMS, Deposits)

    (SSE, Bet Placed)

  • High throughput distributed messaging system

    Highly Availability Efficiency Durable

    Chronos: Why Kafka

    Kafka is a high-throughput distributed messaging system

    Design Principles:

    Highly Available: Replicated Distributed

    High throughput: Stateless Broker

    Efficiency:

    Disk Efficiency : Dont fear the file system modern OSs optimize sequential disk operations/disk caching strategy

    Usage of OS filesystem cache rather than application level cache:

    More efficient (no usage of GC)

    Survive on application restart

    I/O Efficiency : Batching Reduces small I/O operations, this mortize network roundtrip overhead, enhance larger sequential disk operations

    Durable

  • Fates represents the long term memory of Omnia. It organizes the incidents that Chronos collected into timelines and also elaborates new information as views by using machine learning, logical reasoning and time series analysis.

    Fates: Batch layer

    35Omnia: Distributed & Reactive platform for data management

    Customer: 123

    Login

    Deposit

    Bet placed

    Logout

    Event: 78

    Started

    Fault

    Penalty

    GoalTimelines & Views

    Bets Deposits Events Session

    FatesBatch Layer

  • Fates: Batch layer

    36Omnia: Distributed & Reactive platform for data management

    Timelines

    Views

    Jobs

    Fates

  • Fates: CassandraCassandra is the long term storage for our data.

    Highly Available (CAP) Linear Scalability Multi DC Separation of Concerns (Production and Analytic DCs) High performance and optimal for WRITE operations

  • NeoCortex represents the short term memory of Omnia. It offers a framework to develop micro services on top of Apache Spark. It performs fast and real time data processing with the data acquired from Chronos and Fates.

    NeoCortex: Speed layer

    38Omnia: Distributed & Reactive platform for data management

    NeoCortex

    Bets

    Depo

    sits

    Events

    Session

    Micro Services

    Outpu

    t

  • Hermes is a scalable and full duplex communication for B2C and B2B.

    Hermes: Serving Layer

    39Omnia: Distributed & Reactive platform for data management

    B2C Browser

    B2BLoa

    d ba

    lanc

    er

    Push Server

    Distribute Cache Push

    Server

    Push Server

    TCP

    WS

    HTTP

    JS A

    PI WH Apps

    Cac

    he

    Cac

    he

    Apps

  • Custom advert, bonus, data load prediction, bot detection...

    Omnia Data Flow

    40

    Chr

    onos

    Dat

    a So

    urce

    NeoCortexSpeed Layer

    FatesBatch Layer

    Her

    mes

    Serv

    ing

    Laye

    r

    Input Output

    Omnia: Distributed & Reactive platform for data management

    Users become a new data producer

  • Real time monitoring and elasticityDocker and Mesos: Scale In&Out based on demand,

    Omnia on Omnia

    41

    Chr

    onos

    Dat

    a So

    urce

    NeoCortexSpeed Layer

    FatesBatch Layer

    Her

    mes

    Serv

    ing

    Laye

    r

    Input Output

    Omnia: Distributed & Reactive platform for data management

    JMX

    JMX

    JMX

  • Omnia infrastructure

    42Omnia: Distributed & Reactive platform for data management

    Omnia

    Docker

    Marathon

    Mesos

    Node Node Node Node Node

  • Thank youcareers.williamhillplc.com

    omnia.williamhill.com/`typesafe.com/reactive-big-data

    https://careers.williamhillplc.com/https://careers.williamhillplc.com/http://typesafe.com/reactive-big-data


Recommended