Date post: | 21-Jan-2017 |
Category: |
Data & Analytics |
Upload: | guido-schmutz |
View: | 1,462 times |
Download: | 1 times |
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Introduction to Streaming Analytics
Guido Schmutz
Guido Schmutz
Working for Trivadis for more than 19 yearsOracle ACE Director for Fusion Middleware and SOACo-Author of different booksConsultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast DataMember of Trivadis Architecture BoardTechnology Manager @ Trivadis
More than 25 years of software development experience
Contact: [email protected]: http://guidoschmutz.wordpress.comTwitter: gschmutz
Our company.
© Trivadis – The Company3 03.06.16
Trivadis is a market leader in IT consulting, system integration, solution engineeringand the provision of IT services focusing on and and Open Source technologiesin Switzerland, Germany, Austria and Denmark. We offer our services in the followingstrategic business fields:
Trivadis Services takes over the interacting operation of your IT systems.
O P E R A T I O N
COPENHAGEN
MUNICH
LAUSANNEBERN
ZURICHBRUGG
GENEVA
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
VIENNA
With over 600 specialists and IT experts in your region.
© Trivadis – The Company4 03.06.16
14 Trivadis branches and more than600 employees
200 Service Level Agreements
Over 4,000 training participants
Research and development budget:CHF 5.0 million
Financially self-supporting andsustainably profitable
Experience from more than 1,900 projects per year at over 800customers
Agenda
1. Introduction & Foundation2. Designing Streaming Analytics Solutions
3. Implementing Event Hub
4. Implementing Data Ingestion
5. Implementing Streaming Analytics
6. Scalability & Reliability7. Streaming Analytics in Architecture
8. Summary
Introduction & Foundation
Big Data Definition (4 Vs)
+Timetoaction?– BigData+Real-Time=StreamProcessing
CharacteristicsofBigData:ItsVolume,VelocityandVarietyincombination
The world is changing …
The model of Generating/Consuming Data has changed ….
Old Model: few companies are generating data, all others are consuming data
New Model: all of use are generating data, and all of us are consuming data
Who is generating Big Data?
The progress and innovation is no longer hindered by the ability to collect data
But by the ability to manage, analyze, summarize, visualize and discover knowledge from the collected data in a timely manner and in a scalable fashion
Socialmediaandnetworks(allofusaregeneratingdata)
Scientificinstruments(collectingallsortsofdata)
Mobiledevices(trackingallobjectsallthetime)
Sensortechnologyandnetworks(measuringallkinds ofdata)
Traditional Data Processing - Challenges
• Introduces too much “decision latency”
• Responses are delivered “after the fact”
• Maximum value of the identified situation is lost
• Decision are made on old and stale data
• “Data a Rest”
The New Era: Streaming Data Analytics / Fast Data
• Events are analyzed and processed in real-time as the arrive
• Decisions are timely, contextual and based on fresh data
• Decision latency is eliminated
• “Data in motion”
Real Time Analytics Use Cases
• Algorithmic Trading
• Online Fraud Detection
• Geo Fencing
• Proximity/Location Tracking
• Intrusion detection systems
• Traffic Management
• Recommendations
• Churn detection
• Internet of Things (IoT) / Intelligence
Sensors
• Social Media/Data Analytics
• Gaming Data Feed
• …
What happen in an internet minute
Internet Of Things – Sensorsare/will be everywhereThere are more devices tapping into the internet than people on earth
How do we prepare our systems/architecture for the future?
Source:CiscoSource:TheEconomist
Different Types of Stream/Event Processing
Simple Event Processing (SEP)
Event Stream Processing (ESP)
Different Types of Stream/Event Processing
Complex Event Processing (CEP)
Native Streaming vs. Micro-Batching
Native Streaming• Events processed as they
arrive• + low-latency• - throughput• - fault tolerance is expensive
Micro-Batching• Splits incoming stream in
small batches• + high(er) throughput• + easier fault tolerance• - lower latency
Source: Distributed Real-TimeStreamProcessing:WhyandHowbyPetrZapletal
How to design a Streaming Analytics Solution?
EventStream
eventDataIngestion
event
Persist(Queue)
EventStream
eventDataIngestion
event
Analytics
eventAnalytics
result
result
EventStream
event DataIngestion/Analytics
result
Demo Use Case – Truck Sensors
Truck DataIngestion Geo-Fencing
2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631
{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}
RecklessDrivingDetector
NEAR
ENTER
TruckDriver
DashboardMovement MovementJSON
RecklessDriver
Designing Streaming Analytics Solutions
How to design a Streaming Analytics System?It usually starts very simple … just one data pipeline
EventStream
AnalyticseventData
Ingestion
New Event Stream sources are added …
EventStream
Analytics
2nd EventStream
3rd EventStream
nth EventStream
event
event
event
event
DataIngestion
2nd DataIngestion
3rd DataIngestion
Nth DataIngestion
New Processors are interested in the events …
EventStream
Analytics
2nd EventStream
3rd EventStream
nth EventStream
2nd Analyticsevent
event
event
event
DataIngestion
2nd DataIngestion
3rd DataIngestion
Nth DataIngestion
… and the solution becomes the problem
EventStream
Analytics
2nd EventStream
3rd EventStream
nth EventStream
2nd Analytics
3rd Analytics
Nth
Analytics
event
event
event
event
DataIngestion
2nd DataIngestion
3rd DataIngestion
Nth DataIngestion
… and the solution becomes the problem
EventStream
Analytics
2nd EventStream
3rd EventStream
nth EventStream
2nd Analytics
3rd Analytics
Nth
Analytics
event
event
event
event
DataIngestion
2nd DataIngestion
3rd DataIngestion
Nth DataIngestion
… and the solution becomes the problem
NewCustomers
OperationalLogs
ClickStream
MeterReadings
event
event
event
event
CDCIngestion
LogIngestion
ClickStreamIngestion
SenorIngestion
Hadoop/DataWarehouse
RecommendationSystem
LogSearch
FraudDetection
Decouple event streams from consumers
„UnifiedLog“
Remember EnterpriseService Bus(ESB)?
Enterprise EventBus EventStreamAnalyticsEventStream Ingestion
CDCIngestion
LogIngestion
ClickStreamIngestion
SenorIngestion
Hadoop/DataWarehouse
RecommendationSystem
LogSearch
FraudDetection
What istheideaofaUnifiedLog?
NewCustomers
OperationalLogs
ClickStream
MeterReadings
Unified Log – What is it?
By Unified Log, we do not mean this ….137.229.78.245 - - [02/Jul/2012:13:22:26 -0800] "GET /wp-includes/js/tinymce/wp-tinymce.php?c=1&ver=349-20805 HTTP/1.1" 200 101114
137.229.78.245 - - [02/Jul/2012:13:22:28 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30747
137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "POST /wp-admin/post.php HTTP/1.1" 302 -
137.229.78.245 - - [02/Jul/2012:13:22:40 -0800] "GET /wp-admin/post.php?post=387&action=edit&message=1 HTTP/1.1" 200 73160
137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/css/editor.css?ver=3.4.1 HTTP/1.1" 304 -
137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "GET /wp-includes/js/tinymce/langs/wp-langs-en.js?ver=349-20805 HTTP/1.1" 304 -
137.229.78.245 - - [02/Jul/2012:13:22:41 -0800] "POST /wp-admin/admin-ajax.php HTTP/1.1" 200 30809
… but this• a structured log (records are numbered beginning with 0 based on order they are written)• aka. commit log or
journal
0 1 2 3 4 5 6 7 8 9 10
11
1st record Nextrecordwritten
Central Unified Log for (real-time) subscription
Take all the organization’s data (events) and put it into a central log for subscriptionProperties of the Unified Log:
• Unified: “Enterprise”, single deployment
• Append-Only: events are appended, no update in place => immutable
• Ordered: each event has an offset, which is unique within a shard
• Fast: should be able to handle thousands of messages / sec
• Distributed: lives on a cluster of machines
0 1 2 3 4 5 6 7 8 9 10
11
reads
writes
Collector
ConsumerSystemA(time=6)
ConsumerSystemB(time=10)
reads
Implementing Event Bus
Apache Kafka - Overview
Distributed publish-subscribe messaging system
Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …)
Initially developed at LinkedIn, now part of Apache
Does not use JMS API and standards
Kafka maintains feeds of messages in topics
Kafka Cluster
Consumer Consumer Consumer
Producer Producer Producer
Apache Kafka - Motivation
LinkedIn’s motivation for Kafka was:
• “A unified platform for handling all the real-time data feeds a large company might have.”
Must haves
• High throughput to support high volume event feeds.
• Support real-time processing of these feeds to create new, derived feeds.
• Support large data backlogs to handle periodic ingestion from offline systems.
• Support low-latency delivery to handle more traditional messaging use cases.
• Guarantee fault-tolerance in the presence of machine failures.
Apache Kafka - Architecture
Kafka Broker
Movement Processor
MovementTopic
Engine-MetricsTopic
1 2 3 4 5 6
EngineProcessor1 2 3 4 5 6
Truck
Apache Kafka - Architecture
Kafka Broker
Movement Processor
MovementTopic
Engine-MetricsTopic
1 2 3 4 5 6
EngineProcessor
Partition0
1 2 3 4 5 6Partition0
1 2 3 4 5 6Partition1 Movement
ProcessorTruck
ApacheKafka
Kafka Broker
Movement Processor
Truck
MovementTopic
Engine-MetricsTopic
EngineProcessor
P0
Movement Processor
1 2 3 4 5
P1 1 2 3 4 5
Kafka BrokerMovementTopic
Engine-MetricsTopic
P0 1 2 3 4 5
P1 1 2 3 4 5
P0 1 2 3 4 5
P0 1 2 3 4 5
Apache Kafka - Partition offsets
Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset
• Consumers track their pointers via (offset, partition, topic) tuples
Consumer groupC1
Apache Kafka - Performance
Kafka at LinkedIn => over 1100 brokers / 60 clusters
Kafka Performance at own setup => 6 brokers (VM) / 1 cluster
• 445’622 messages/second• 31 MB / second • 3.0405 ms average latency between producer / consumer
800billionmessages/day
175TBproduced/day
650TBconsumed/day
13millionmessages/second2.75GB/second
atbusiesttimeofday
http://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
https://engineering.linkedin.com/kafka/running-kafka-scale
Demo Use Case – Truck Sensors
Truck DataIngestion Geo-Fencing
2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631
{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}
RecklessDrivingDetector
NEAR
ENTER
TruckDriver
DashboardMovement MovementJSON
RecklessDriver
Demo: Consuming Kafka Topic
Demo: Monitoring Kafka Cluster with Kafka Manager
Implementing Data Ingestion
StreamSets Data Collector
• Founded by ex-Cloudera, Informaticaemployees
• Continuous open source, intent-driven, big data ingest
• Visible, record-oriented approach fixes combinatorial explosion
• Batch or stream processing• Standalone, Spark cluster, MapReduce
cluster• IDE for pipeline development by ‘civilians’• Relatively new - first public release
September 2015• So far, vast majority of commits are from
StreamSets staff
Apache NiFi
• Originated at NSA as Niagarafiles
• Open sourced December 2014, Apache TLP July 2015
• Opaque, file-oriented payload
• Distributed system of processors with centralized control
• Based on flow-based programmingconcepts
• Data Provenance
• Web-based user interface
Demo Use Case – Truck Sensors
Truck DataIngestion Geo-Fencing
2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631
{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}
RecklessDrivingDetector
NEAR
ENTER
TruckDriver
DashboardMovement MovementJSON
RecklessDriver
Demo: Using Apache NiFi for Collection
Implementing Streaming Analytics
Streaming Analytics
Product
Framework/Infrastructure
OpenSource ClosedSource
Implementing Streaming Analytics: Oracle Stream Analytics
History of Oracle Stream Analytics
OracleComplexEventProcessing (OCEP)
OracleEventProcessing (OEP)
OracleStreamExplorer (SX)
OracleEventProcessingforJavaEmbedded
OracleStreamAnalytics(OSA)
OracleEdgeAnalytics(OAE)
BEAWeblogic EventServerOracleCQL
OracleIoT CloudService
2016
2015
2007
2008
2012
2013
OEA
• Filtering• Correlation• Aggregation• Pattern
matching
Devices / Gateways
Services
Computing Edge Enterprise
“Sea of data”
Macro-eventHigh-valueActionableIn-context
EDGEAnalytics
StreamAnalytics
FOG
• High Volume• Continuous Streaming• Extreme Low Latency• Disparate Sources• Temporal Processing• Pattern Matching• Machine Learning
Oracle Stream Analytics: From Noise to Value
• HighVolume• Continuous Streaming• Sub-Millisecond Latency• Disparate Sources• Time-Window Processing• PatternMatching
• HighAvailability /Scalability• Coherence Integration• Geospatial, Geofencing• BigDataIntegration
• Business EventVisualization
• Action!
Oracle Stream Analytics Platform
What it does• Compelling, friendly and visually stunning real time
streaming analytics user experience for Business users to dynamically create and implement Instant Insight solutions
Key Features• Analyze simulated or live data feeds to determine event
patterns, correlation, aggregation & filtering• Pattern library for industry specific solutions• Streams, References, Maps & Explorations
Benefits• Accelerated delivery time• Hides all challenges & complexities of underlying real-time
event-driven infrastructure
Oracle Stream Analytics - Connecting Everything & Anything of Interest to the Business
Understanding of CQL Filtering, Correlation, Pattern: NOT NEEDED
Understanding of IT Deployment and Management: NOT NEEDED
Understanding of Development, Java, Best Practices: NOT NEEDED
Understanding of the Event Driven Platform: NOT NEEDED
Business accessibility to Geo-Streaming Analytics
Real Time Streaming Solutions face an increasing need to track "assets of interest" and initiate actions based on encroachment of boundary proximity to fixed and moving objects and other geographic, temporal, or event conditions.
Geo-Fence,Fence,Polygon
Geo-Streaming
“Addvalue toyourreal timestreaming datadiscoveryandanalytics byapplying andincludingmathematical, statistical analysis totheliveoutput stream”
“These streaming “Excel spreadsheets” really docometolife”
Expression Builder enabling calculation for the Business User
Concept of Connections & Connection Reuse in Streams
Decision Table for Nested IF-THEN-ELSE Rules
Topology View and Navigation
Stream Analytics – Terminology for Business Users
Explorer: The Application User Interface Catalog: The repository for browsing resources
Stream Analytics – Terminology for Business Users
Stream: incoming flow of events that you want to analyze (CSV, Kafka, JMS, Rest, MQTT, …)
Exploration: application that correlates events from streams and data sources, using filters, groupings, summaries, ranges, and more
Stream Analytics – Terminology for Business Users
Shape: A blueprint of an event in a stream or data in a data source. How the business data is represented in the selected stream
Map: collection of geo-fences
Reference: A connection to static data that is joined to a stream to enrich it and/or to be used in business logic and output
Stream Analytics – Terminology for Business Users
Pattern: A pre-built Exploration that addresses a particular business scenario in a focused and simplified User Interface
Connection: collection of metadata required to connect to an external system
Targets: defines an interface with a downstream system
Demo Use Case – Truck Sensors
Truck DataIngestion Geo-Fencing
2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631
{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}
RecklessDrivingDetector
NEAR
ENTER
TruckDriver
DashboardMovement MovementJSON
RecklessDriver
Demo: Oracle Stream Analytics
Demo: Oracle Stream Analytics
Demo: Oracle Stream Analytics
Demo: Oracle Stream Analytics
Implementing Streaming Analytics: Spark Streaming
Apache Spark
Apache Spark is a fast and general engine for large-scale data processing• The hot trend in Big Data!• Originally developed 2009 in UC Berkley’s AMPLab• Based on 2007 Microsoft Dryad paper• Written in Scala, supports Java, Python, SQL and R• Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x
faster on disk• One of the largest OSS communities in big data with over 200 contributors in 50+
organizations• Open Sourced in 2010 – since 2014 part of Apache Software foundation
Apache Spark
SparkSQL(BatchProcessing)
BlinkDB(ApproximateQuerying)
SparkStreaming(Real-Time)
MLlib,SparkR(MachineLearning)
GraphX(GraphProcessing)
SparkCoreAPIandExecutionModel
SparkStandalone MESOS YARN HDFS Elastic
SearchNoSQL S3
Libraries
CoreRuntime
ClusterResourceManagers DataStores
Resilient Distributed Dataset (RDD)
Are• Immutable• Re-computable• Fault tolerant• Reusable
Have Transformations• Produce new RDD• Rich set of transformation available
• filter(), flatMap(), map(), distinct(), groupBy(), union(), join(), sortByKey(), reduceByKey(), subtract(), ...
Have Actions• Start cluster computing operations• Rich set of action available
• collect(), count(), fold(), reduce(), count(), …
RDD RDD
Input Source
• File• Database• Stream• Collection
.count() ->100
Data
Partitions RDD
Data
Partition0
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
Partition7
Partition8
Partition9
Server1
Server2
Server3
Server4
Server5
Partitions RDD
Data
Partition0
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
Partition7
Partition8
Partition9
Server1
Server2
Server3
Server4
Server5
Partitions RDD
Data
Partition0
Partition1
Partition2
Partition3
Partition4
Partition5
Partition6
Partition7
Partition8
Partition9
Server2
Server3
Server4
Server5
Stage 1 – reduceByKey()
Stage 1 – flatMap() + map()
Spark Workflow InputHDFSFile
HadoopRDD
MappedRDD
ShuffledRDD
TextFileOutput
sc.hapoopFile()
map()
reduceByKey()
sc.saveAsTextFile()
Transformations(Lazy)
Action(Execute
Transformations)
Master
MappedRDD
P0
P1
P3
ShuffledRDD
P0
MappedRDD
flatMap()
DAGScheduler
Spark Execution Model
DataStorage
Worker
Master
Executer
Executer
Server
Executer
Stage 1 – flatMap() + map()
Spark Execution Model
DataStorage
Worker
Master
Executer
DataStorage
Worker
Executer
DataStorage
Worker
Executer
RDD
P0
P1
P3
NarrowTransformationMaster
filter()map()sample()flatMap()
DataStorage
Worker
Executer
Stage 2 – reduceByKey()
Spark Execution Model
DataStorage
Worker
Executer
DataStorage
Worker
Executer
RDD
P0
WideTransformation
Master
join()reduceByKey()union()groupByKey()
Shuffle!
DataStorage
Worker
Executer
DataStorage
Worker
Executer
Batch vs. Real-Time Processing
PetabytesofData
Gigabytes
PerSecond
Discretized Stream (DStream)
Kafka
Truck
Truck
Truck
Discretized Stream (DStream)
Kafka
Truck
Truck
Truck
Discretized Stream (DStream)
Kafka
Truck
Truck
Truck
Discretized Stream (DStream)
Kafka
Truck
Truck
Truck Discretebytime
IndividualEvent
DStream =RDD
Discretized Stream (DStream)
DStream DStream
XSeconds
Transform
.countByValue()
.reduceByKey()
.join
.map
Discretized Stream (DStream)time1 time2 time3
message
timen….
f(message 1)RDD@time1
f(message 2)
f(message n)
….
message 1RDD@time1
message 2
message n
….
result 1
result 2
result n
….
message message message
f(message 1)RDD@time2
f(message 2)
f(message n)
….
message 1RDD@time2
message 2
message n
….
result 1
result 2
result n
….
f(message 1)RDD@time3
f(message 2)
f(message n)
….
message 1RDD@time3
message 2
message n
….
result 1
result 2
result n
….
f(message 1)RDD@timen
f(message 2)
f(message n)
….
message 1RDD@timen
message 2
message n
….
result 1
result 2
result n
….
InputStream
EventDStream
MappedDStreammap()
saveAsHadoopFiles()
TimeIncreasing
DStream
TransformationLineage
Actio
nsTrig
ger
SparkJobs
Adapted fromChrisFregly: http://slidesha.re/11PP7FV
Demo Use Case – Truck Sensors
Truck DataIngestion Geo-Fencing
2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631
{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}
RecklessDrivingDetector
NEAR
ENTER
TruckDriver
DashboardMovement MovementJSON
RecklessDriver
Implementing Streaming Analytics: Apache Storm
Apache Storm
A platform for doing analysis on streams of data as they come in, so you can react to data as it happens.• highly distributed real-time computation system
• Provides general primitives to do real-time computation
• To simplify working with queues & workers
• scalable and fault-tolerant
Originated at Backtype, acquired by Twitter in 2011Open Sourced late 2011Part of Apache since September 2013
Apache Storm – Core concepts
Tuple• Immutable Set of Key/value pairs
Stream• an unbounded sequence of tuples that can be processed in parallel by Storm
Topology• Wires data and functions via a DAG (directed acyclic graph)• Executes on many machines similar to a MR job in Hadoop
Spout• Source of data streams (tuples)• can be run in “reliable” and “unreliable” mode
Bolt• Consumes 1+ streams and produces new streams• Complex operations often require multiple
steps and thus multiple bolts
Spout
Spout
Bolt
Bolt
Bolt
Bolt
SourceofStreamB
Subscribes:AEmits:C
Subscribes:AEmits:D
Subscribes:A&BEmits:-
Subscribes:C&DEmits:-
T T T T T T T T
Demo Use Case – Truck Sensors
Truck DataIngestion Geo-Fencing
2016-06-02 14:39:56.605|98|27|MarkLochbihler|803014426|Wichita toLittle Rock Route 2|Normal|38.65|-90.21|5187297736652502631
{"timestamp": "2016-06-0214:39:56.991","truckId": 99,"driverId": 31,"driverName":"Rommel Garcia", "routeId":1565885487, "routeName":"Springfield toKCViaHanibal","eventType":"Normal", "latitude":37.16,"longitude": "-94.46","correlationId":5187297736652502631}
RecklessDrivingDetector
NEAR
ENTER
TruckDriver
DashboardMovement MovementJSON
RecklessDriver
Apache Storm – How does it work ?
GeoHashing
TrucksMovement
GeoHashing
{"timestamp" :"2016-06-02
ShuffleGrouping
GeoHashing
{"timestamp" :"2016-06-0212:56:02.362","truckId" :35,"driverId" :26,"driverName" :"Michael Aube", "routeId" :1090292248, "eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}
TruckMovement
{"timestamp" :"2016-06-02
“geohash” :“dp206n3d“,
Apache Storm – How does it work ?
GeoHashing
TrucksMovement
GeoFencer
GeoHashing
GeoFencer
GeoHashing
ShuffleGrouping
FieldsGrouping
TruckMovement
{"timestamp" :"2016-06-02
{"timestamp" :"2016-06-0212:56:02.362","truckId" :35,"driverId" :26,"driverName" :"Michael Aube", "routeId" :1090292248, "eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}
{“geohash” :“dp206n3d“, "timestamp" :"2016-06-02 12:56:02.362","truckId" :35,"driverId" :26,"driverName" :"MichaelAube", "routeId" :1090292248,"eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}
{“geohash” :“f00hfh99“, ..
{ "timestamp" :"2016-06-02
Apache Storm – How does it work ?
GeoHashing
TrucksMovement
GeoFencer
GeoHashing
GeoFencer
Alerter
GeoHashing
ShuffleGrouping
FieldsGrouping
GlobalGrouping
TruckMovement
{"timestamp" :"2016-06-02
{"timestamp" :"2016-06-0212:56:02.362","truckId" :35,"driverId" :26,"driverName" :"Michael Aube", "routeId" :1090292248, "eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}
{“geohash” :“dp206n3d“, "timestamp" :"2016-06-02 12:56:02.362","truckId" :35,"driverId" :26,"driverName" :"MichaelAube", "routeId" :1090292248,"eventType" :"Normal", "latitude" :40.86,"longitude" :"-89.91"}
{"timestamp" :"2016-06-02
{"timestamp" :"2016-06-02 12:56:02.362","truckId" :35,"driverId" :26, "latitude" :40.86,"longitude" :"-89.91"}
{“geohash” :“f00hfh99“, ..
Apache Storm – Core concepts
Each Spout or Bolt are running N instances in parallel
GeoHashingnth
TrucksMovement
GeoFencingnth
GeoHashing GeoFencing1st
Shuffle Fields
Shufflegrouping israndomgroupingFieldsgrouping isgroupedbyvalue,suchthatequalvalueresultsinequaltaskAllgrouping replicatestoalltasksGlobalgrouping makesalltuples gotoonetaskNonegrouping makesboltruninthesamethreadasbolt/spout itsubscribestoDirectgrouping producer(taskthatemits)controlswhichconsumerwillreceiveLocal orShufflegrouping
similartotheshufflegroupingbutwillshuffletuplesamongbolttasksrunninginthesameworkerprocess,ifany.Fallsbacktoshufflegrouping behavior.
ReportGlobal
Scalability & Reliability
How to scale a Streaming Analytics System?
Queue(Persist)
EventStream
event
CollectingThread1 event event
ProcessingThread1 result
CollectingThread2
ProcessingThread2
event event event result
CollectingThreadn
ProcessingThreadn
CollectingProcess1
CollectingProcess1
CollectingProcess1
CollectingProcess1
CollectingProcess1
How to scale a Streaming Analytics System?
Queue1(Persist)
EventStream
event
CollectingThread1
event event ProcessingProcess1 result
CollectingThread1
ProcessingProcess1
Queue2(Persist)event
event event result
ProcessingProcess1
Queuen(Persist)
CollectingProcess1
CollectingProcess2
Processing AProcess 2
Processing BProcess 2
Processing AProcess 1
Processing BProcess 1
How to scale a Streaming Analytics System?
EventStream
CollectingProcess1
CollectingProcess2
Processing AThread 2
Q2e
Processing BThread 2
Q2e
Processing AThread 1
Q1e
Processing BThread 1
Q1e
Processing AProcess 2
Processing AThread n
Qne
How to make Streaminig Analytics System reliable?
Faults and stragglers inevitable in large clusters running big data applicationsStreaming applications must recover from them quickly
CollectingProcess2
Processing AProcess 2
Processing BProcess 2
EventStream
CollectingProcess2
Processing AThread 2
Q2e
Processing BThread 2
Q2e
CollectingProcess2
Processing AProcess 2
Processing BProcess 2
EventStream
CollectingProcess2
Processing AThread 2
Q2e
Processing BThread 2
Q2e
How to deal with “Stragglers”
Consumer goes slow
Backpressure Queue upDrop data
Other jobs grindto a halt L
Run out ofmemory L
Spill to diskNo thanks L
How to make Streaming Analytics System reliable?
Solution 1: using active/passive system (hot replication)• Both systems process the full load• In case of a failure, automatically switch and use the “passive” system• Stragglers slow down both active and passive system
State
=Statein-memoryand/oron-disk
CollectingProcess2
Processing AProcess 2
Processing BProcess 2
EventStream
CollectingProcess2
Processing AThread 2
Q2e
Processing BThread 2
Q2e
Active
CollectingProcess2
Processing AProcess 2
Processing BProcess 2
CollectingProcess2
Processing AThread 2
Q2e
Processing BThread 2
Q2e
Passive
State
State
How to make Streaming Analytics System reliable?
Solution 2: Upstream backup• Nodes buffer sent messages and reply them to new node in case of failure• Stragglers are treated as failures
State =Statein-memoryand/oron-disk
buffer =Bufferforreplayin-memoryand/oron-disk
CollectingProcess2
Processing AProcess 2
Processing BProcess 2
EventStream
CollectingProcess2
Processing AThread 2
Q2e
Processing BThread 2
Q2e
State
Message Delivery Semantics
At most once [0,1]• Messages my be lost • Messages never redelivered
At least once [1 .. n]• Messages will never be lost • but messages may be redelivered
(might be ok if consumer can handle it)
Exactly once [1]• Messages are never lost• Messages are never redelivered• Perfect message delivery• Incurs higher latency for transactional
semantics
Streaming Analytics in Architecture
“Traditional Architecture” for Big Data
DataCollection (Analytical)DataProcessing ResultStoreData
Sources
Channel
DataConsumer
Reports
Service
AnalyticTools
AlertingTools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
Batchcompute
Stage
ResultStore
QueryEngine
ComputedInformation
RawData(Reservoir)
=DatainMotion =DataatRest
Streaming Analytics Architecture for Big Dataaka. (Complex) Event Processing)
DataCollection
Batchcompute
DataSources
Channel
DataConsumer
Reports
Service
AnalyticTools
AlertingTools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
ResultStore
Messaging
ResultStore
=DatainMotion =DataatRest
Keep raw event data
DataCollection
Batchcompute
DataSources
Channel
DataConsumer
Reports
Service
AnalyticTools
AlertingTools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
ResultStore
Messaging
ResultStore
=DatainMotion =DataatRest
(Analytical)BatchDataProcessing
RawData(Reservoir)
“Lambda Architecture” for Big Data
DataCollection
(Analytical)BatchDataProcessing
Batchcompute
ResultStoreDataSources
Channel
DataConsumer
Reports
Service
AnalyticTools
AlertingTools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
Batchcompute
Messaging
ResultStore
QueryEngine
ResultStore
ComputedInformation
RawData(Reservoir)
=DatainMotion =DataatRest
“Kappa Architecture” for Big Data
DataCollection
“RawDataReservoir”
Batchcompute
DataSources
Messaging
DataConsumer
Reports
Service
AnalyticTools
AlertingTools
Social
Logfiles
Sensor
RDBMS
ERP
Mobile
Machine
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
ResultStore
Messaging
ResultStore
RawData(Reservoir)
=DatainMotion =DataatRest
ComputedInformation
“Unified Architecture” for Big Data
DataCollection
(Analytical)BatchDataProcessing(CalculateModelsofincomingdata)
Batchcompute
ResultStoreDataSources
Channel
DataConsumer
Reports
Service
AnalyticTools
AlertingTools
Social
RDBMS
Sensor
ERP
Logfiles
Mobile
Machine
(Analytical)Real-TimeDataProcessing
Stream/EventProcessing
Batchcompute
Messaging
ResultStore
QueryEngine
ResultStore
ComputedInformation
RawData(Reservoir)
=DatainMotion =DataatRest
PredictionModels
Summary
Summary
More and more use cases (such as IoT) make Streaming Analytics necessary
Treat events as events! Infrastructures for handling lots of events are available!
Platforms such as Oracle Stream Analytics enable the business to work directly on streaming data (empower the business analyst) => User Experience of an Excel Sheet on streaming data
Platform such as Apache Strom and Apache Spark Streaming provide a highly-scalable and fault-tolerant infrastructure for streaming analytics => Oracle Stream Analytics can use Spark Streaming as the runtime infrastructure
Platforms such as Kafka provide a high volume event broker infrastructure, a.k.a. Event Hub
ComparisonOracleStream Analytics SparkStreaming SparkStorm
Community n.a. >280contributors > 100contributors
Language Options Java,CQL Java,Scala, Python Java,Clojure, Scala,…
ProcessingModels Event-Streaming Micro-Batching Event-Streaming
Processing DSL Yes Yes No
Stateful Ops Yes Yes No
Patterndetection Yes No No
Scalability&Reliability limited yes yes
Distributed RPC No No Yes
DeliveryGuarantees At LeastOnce Exactly Once Atmostonce /Atleastonce
Latency sub-second seconds sub-second
”self-service”forBiz Yes No No
Platform OEP server,SparkStreaming(YARN,Mesos)
YARN,Mesos Standalone,DataStax EE
Storm Cluster,YARN