Date post: | 21-Apr-2017 |
Category: |
Data & Analytics |
Upload: | gwen-chen-shapira |
View: | 5,575 times |
Download: | 1 times |
Real Time Fraud DetectionPatterns and reference architectures
Ted Malaska // PSA Gwen Shapira // Software Engineer
2
• Intro• Review Problem• Quick overview of key technology• High level architecture• Deep Dive into NRT Processing• Completing the Puzzle – Micro-batch, Ingest and Batch
Overview
©2014 Cloudera, Inc. All rights reserved.
3©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data• Formerly consultant• Now Cloudera Engineer:– Sqoop Committer– Kafka– Flume
• @gwenshap
Gwen Shapira
4
• Ted Malaska (PSA at Cloudera)• Hadoop for ~5 years• Contributed to
– HDFS, MapReduce, Yarn, HBase, Spark, Avro, – Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo – And working on a Sentry Patch
• Co-Author to O’Reilly Hadoop Application Architectures• Worked with about 70 companies in 8 countries• Marvel Fan Boy• Runner
Hello
©2014 Cloudera, Inc. All rights reserved.
5
The Problem©2014 Cloudera, Inc. All rights reserved.
6
Credit Card Transaction Fraud
©2014 Cloudera, Inc. All rights reserved.
7
Ikea Meat Balls
©2014 Cloudera, Inc. All rights reserved.
8
Coupon Fraud
©2014 Cloudera, Inc. All rights reserved.
9
Video Game Strategy
©2014 Cloudera, Inc. All rights reserved.
10
Health Insurance Fraud
©2014 Cloudera, Inc. All rights reserved.
11
• Typical Atomic Card Fraud Detection• Ikea Meat Ball• Multi Coupons Combinations • OP or Negative Video Games Strategies • Ad Serving • Health Insurance Fraud• Kid Coming Home From School
Review of the Problem
©2014 Cloudera, Inc. All rights reserved.
12
How do we React• Human Brain at Tennis – Muscle Memory– Reaction Thought– Reflective Meditation
©2014 Cloudera, Inc. All rights reserved.
13
Overview of Key Technologies
©2014 Cloudera, Inc. All rights reserved.
14
Kafka©2014 Cloudera, Inc. All Rights Reserved.
15©2014 Cloudera, Inc. All rights reserved.
•Messages are organized into topics•Producers push messages•Consumers pull messages• Kafka runs in a cluster. Nodes are called brokers
The Basics
16©2014 Cloudera, Inc. All rights reserved.
Topics, Partitions and Logs
17©2014 Cloudera, Inc. All rights reserved.
Each partition is a log
18©2014 Cloudera, Inc. All rights reserved.
Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 1
Partition 0
Partition 2 Partion 2
19©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
20©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
21©2014 Cloudera, Inc. All rights reserved.
Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained with in partition
Order retained with in partition but not over
partitionsOff
Set
X
Off S
et X
Off S
et X
Off S
et Y
Off S
et Y
Off S
et Y
Off sets are kept per consumer group
22
Flume
23
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to FlumeTwitter, logs, JMS, webserver, Kafka
Mask, re-format, validate…
DR, criticalMemory, file,
KafkaHDFS, HBase,
Solr
24
Flume and/or Kafka
©2014 Cloudera, Inc. All rights reserved.
Flume
UpStream
Flume Source
Interceptor
Flume Channel
Flume Sink
Down Stream
SelectorCan Be KafkaCan Be KafkaCan Be Kafka
25©2014 Cloudera, Inc. All rights reserved.
Interceptors• Mask fields• Validate information against external source• Extract fields• Modify data format• Filter or split events
26
SparkStreaming
27
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1))3. val lines = ssc.socketTextStream("localhost", 9999)4. val words = lines.flatMap(_.split(" "))5. val pairs = words.map(word => (word, 1))6. val wordCounts = pairs.reduceByKey(_ + _)7. wordCounts.print()8. SSC.start()
28
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf)3. val lines = sc.textFile(path, 2)4. val words = lines.flatMap(_.split(" "))5. val pairs = words.map(word => (word, 1))6. val wordCounts = pairs.reduceByKey(_ + _)7. wordCounts.print()
29Confidentiality Information Goes Here
DStream
DStream
DStream
Spark Streaming
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first Batch
First Batch
Second Batch
30Confidentiality Information Goes Here
DStream
DStream
DStreamSpark Streaming
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Source Receiver RDD
RDD
RDD
Single PassFilter Count
Pre-first Batch
First Batch
Second Batch
Stateful RDD 1
Stateful RDD 2
Stateful RDD 1
31
Spark Streaming and HBase
©2014 Cloudera, Inc. All rights reserved.
Driver
Walker Node
Configs
Executor
Static SpaceConfigs
HConnection
Tasks Tasks
Walker NodeExecutor
Static SpaceConfigs
HConnection
Tasks Tasks
32
High Level Architecture
©2014 Cloudera, Inc. All rights reserved.
33
Real-Time Event Processing Approach
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster IIStorage Processing
SolR
Hadoop Cluster I
ClientClientFlume Agents Hbase /
Memory
Spark Streamin
g
HDFS
Hive/ImpalaMap/
ReduceSpark
Search
Automated & Manual
Analytical Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of NRT
Changes and Counters
Local Cache
Kafka
Clients:(Swipe here!)
Web App
34
NRT Processing©2014 Cloudera, Inc. All rights reserved.
35
Focus on NRT First
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster IIStorage Processing
SolR
Hadoop Cluster I
ClientClientFlume Agents Hbase /
Memory
Spark Streamin
g
HDFS
Hive/ImpalaMap/
ReduceSpark
Search
Automated & Manual
Analytical Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of NRT
Changes and Counters
Local Cache
Kafka
Clients:(Swipe here!)
Web App
NRT Event Processing with Context
36
Streaming Architecture – NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume SourceFlume Source
Kafka
Initial Events Topic
Flume SourceFlume InterceptorEvent Processing
LogicLocal
MemoryHBase Client
Kafka
Answer Topic
HBase
Kafk
a Co
nsum
er
Kafk
a Pr
oduc
er
Able to respond with in 10s of milliseconds
37
Partitioned NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume SourceFlume Source
Kafka
Initial Events Topic Flume SourceFlume InterceptorEvent Processing
LogicLocal
MemoryHBase Client
Kafka
Answer Topic
HBase
Kafk
a Co
nsum
er
Kafk
a Pr
oduc
er
TopicPartition A
Partition B
Partition C
Producer
Partitioner
Producer
Partitioner
Producer
Partitioner
Custom Partitioner
Better use of local memory
38
Completing the Puzzle
©2014 Cloudera, Inc. All rights reserved.
39
Micro Batching
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster IIStorage Processing
SolR
Hadoop Cluster I
ClientClientFlume Agents Hbase /
Memory
Spark Streamin
g
HDFS
Hive/ImpalaMap/
ReduceSpark
Search
Automated & Manual
Analytical Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of NRT
Changes and Counters
Local Cache
Kafka
Clients:(Swipe here!)
Web App
Micro Batching
Micro BatchingMicro Batching
40
Complex Topologies
©2014 Cloudera, Inc. All rights reserved.
Kafka
Initial Events Topic
Spark Streaming
Kafk
a Di
rect
Co
nnec
tion
Dag Topologies
Kafka
Initial Events Topic
Spark StreamingKafka Receivers Dag Topologies
Kafka Receivers
Kafka Receivers
• Manages Offset• Stores Offset is RDD• No longer needs HDFS for initial RDD check
pointing
• Lets Kafka Manage Offsets• Uses HDFS for initial RDD recovery
1.3
1.2
41©2014 Cloudera, Inc. All rights reserved.
MicroBatch Bad-Input Handling
0 1 2 3 4 5 6 7 8 9 10
11
12
13
Kafka – incoming events topic
Dag Topologies
0 1 2 3 4 5 6 7 8 9 10
11
12
13
Kafka – bad events topic
0 1 2 3 4 5 6 7 8 9 10
11
12
13
Kafka – resolved events topic
0 1 2 3 4 5 6 7 8 9 10
11
12
13
Kafka – results topic
42
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster IIStorage Processing
SolR
Hadoop Cluster I
ClientClientFlume Agents Hbase /
Memory
Spark Streamin
g
HDFS
Hive/ImpalaMap/
ReduceSpark
Search
Automated & Manual
Analytical Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of NRT
Changes and Counters
Local Cache
Kafka
Clients:(Swipe here!)
Web App
Ingestion
Ingestion
43
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Flume HDFS SinkKafka Cluster
TopicPartition A
Partition B
Partition C
SinkSinkSink
HDFS
Flume SolR SinkSinkSinkSink
SolR
Flume Hbase SinkSinkSinkSink
HBase
44
Reflective Thoughts
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster IIStorage Processing
SolR
Hadoop Cluster I
ClientClientFlume Agents Hbase /
Memory
Spark Streamin
g
HDFS
Hive/ImpalaMap/
ReduceSpark
Search
Automated & Manual
Analytical Adjustments and Pattern detection
Fetching & Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated & Manual
Review of NRT
Changes and Counters
Local Cache
Kafka
Clients:(Swipe here!)
Web App
Research and Searching
©2014 Cloudera, Inc. All rights reserved.