Date post: | 15-Apr-2017 |
Category: |
Technology |
Upload: | gwen-chen-shapira |
View: | 1,433 times |
Download: | 0 times |
Real Time Anomaly DetectionPatterns and reference architectures
Gwen Shapira, System Architect
©2014 Cloudera, Inc. All rights reserved.
Overview• Intro• Review Problem• Quick overview of key technology• High level architecture• Deep Dive into NRT Processing• Completing the Puzzle – Micro-batch, Ingest and Batch
©2014 Cloudera, Inc. All rights reserved.
Gwen Shapira• 15 years of moving data• Formerly consultant, engineer• System Architect @ Confluent• Kafka Committer• @gwenshap
There’s a Book on That
Founded by creators of Kafka - @jaykreps, @nehanarkhede, @junrao
We help you gather, transport, organize, and analyze all of your stream data
What we offer• Confluent Platform• Kafka plus critical bug fixes not yet applied in Apache release• Kafka ecosystem projects• Enterprise support• Training and Professional Services
©2014 Cloudera, Inc. All rights reserved.
The Problem
©2014 Cloudera, Inc. All rights reserved.
Credit Card Transaction Fraud
©2014 Cloudera, Inc. All rights reserved.
Coupon Fraud
©2014 Cloudera, Inc. All rights reserved.
Video Game Strategy
©2014 Cloudera, Inc. All rights reserved.
Health Insurance Fraud
©2014 Cloudera, Inc. All rights reserved.
How do we React• Human Brain at Tennis
• Muscle Memory• Reaction Thought• Reflective Meditation
©2014 Cloudera, Inc. All rights reserved.
Overview of Key Technologies
©2014 Cloudera, Inc. All Rights Reserved.
Kafka
©2014 Cloudera, Inc. All rights reserved.
The Basics
• Messages are organized into topics
• Producers push messages• Consumers pull messages• Kafka runs in a cluster. Nodes are called brokers
©2014 Cloudera, Inc. All rights reserved.
Topics, Partitions and Logs
©2014 Cloudera, Inc. All rights reserved.
Each partition is a log
©2014 Cloudera, Inc. All rights reserved.
Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 1
Partition 0
Partition 2 Partion 2
©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained with in partition
Order retained with in partition but not over
partitionsOff
Set
X
Off S
et X
Off S
et X
Off S
et Y
Off S
et Y
Off S
et Y
Off sets are kept per consumer group
Consumer-Producer Pattern
Keeping Things Simple• Consume records from Kafka Topic• Filter, transform, join, lookups, aggregate• Write to another Kafka Topic• https://github.com/confluentinc/examples/tree/master/specifi
c-avro-consumer
Kafka Makes Streams Easy• Producers partition the data• Consumers load balance partitions• Add / remove consumers any way you want• Will work with any framework (or none!)
Coming Soon to Kafka Near You
• KafkaConnect - Export / Import for Kafka - 0.9.0 (Its here!)• KStream
• Consumer-Producer client - Processor (0.10.0 - April?)• DSLs:
• KStream (a bit like Spark) - (0.10.0 - April?)• SQL - ???
KConnect - Its a thing• Easy to add connectors to Kafka• Existing connectors
• JDBC• HDFS• MySQL * 2• ElasticSearch * 4• Cassandra• S3 * 2• MQTT• Twitter
• Kafka Connectors:• http://www.confluent.io/developers/connectors• http://docs.confluent.io/2.0.0/connect/index.html
• KStreams:• https://github.com/gwenshap/kafka-examples/blob/master/
KafkaStreamsAvg
SparkStreaming
©2014 Cloudera, Inc. All rights reserved.
Spark Example1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
©2014 Cloudera, Inc. All rights reserved.
Spark Streaming Example1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()
Spark Streaming
Confidentiality Information Goes Here
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first Batch
First Batch
Second Batch
Confidentiality Information Goes Here
DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count
Pre-first Batch
First Batch
Second Batch
Stateful RDD 1
Stateful RDD 2
Stateful RDD 1
©2014 Cloudera, Inc. All rights reserved.
High Level Architecture
©2014 Cloudera, Inc. All rights reserved.
Real-Time Event Processing Approach
Hadoop Cluster IIStorage Processing
SolR
Hadoop Cluster I
ClientClientFlume Agents
Hbase / Memory
Spark Streaming
HDFS
Hive/ImpalaMap/
ReduceSpark
Search
Automated & Manual
Analytical Adjustments and Pattern detection
Fetching & Updating Profiles
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of NRT Changes and
Counters
Local Cache
Kafka
Clients:(Swipe here!)
Web App
Adjust NRT Statistics
Yarn / Mesos
Analytics Layer
SolR
ClientClientKStreams
Analytical Adjustment
s and Pattern
detection
Fetching & Updating Profiles
Adjusting NRT Stats Batch Time Adjustments
Review of NRT
Changes and
CountersLocal Cache
Kafka
Clients:(Swipe here!)
Web App
Kafka
HDFS
NoSQL
DWH
Connecor
Connector
KStreamProcessor
Profile Updates
Model Updates
Transactions
Local Store
Decisions
DWH
RedoLog
KStreamProcessorKStreamProcessor
©2014 Cloudera, Inc. All rights reserved.
NRT Processing
©2014 Cloudera, Inc. All rights reserved.
Focus on NRT First
Hadoop Cluster IIStorage Processing
SolR
Hadoop Cluster I
ClientClientProcessor
Hbase / Memory
Spark Streaming
HDFS
Hive/ImpalaMap/
ReduceSpark
Search
Automated & Manual
Analytical Adjustments and Pattern detection
Fetching & Updating Profiles
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of NRT Changes and
Counters
Local Cache
Kafka
Clients:(Swipe here!)
Web App
Adjust NRT Statistics
©2014 Cloudera, Inc. All rights reserved.
Streaming Architecture – NRT Event Processing
Kafka
Initial Events TopicEvent Processing Logic
Local Memory
HBase Client
Kafka
Answer Topic
HBase
Kafk
a Co
nsum
er
Kafk
a Pr
oduc
er
Able to respond with in 10s of milliseconds
©2014 Cloudera, Inc. All rights reserved.
Partitioned NRT Event Processing
Kafka
Initial Events Topic
Event Processing Logic
Local Cache
HBase Client
Kafka
Answer Topic
HBase
Kafk
a Co
nsum
er
Kafk
a Pr
oduc
er
TopicPartition A
Partition B
Partition C
Producer
Partitioner
Producer
Partitioner
Producer
Partitioner
Custom Partitioner
Better use of local memory
©2014 Cloudera, Inc. All rights reserved.
Questions?http://confluent.io
@confluentInc@gwenshap