Fraud Detection for Israel BigThings Meetup

Real Time Anomaly DetectionPatterns and reference architectures

Gwen Shapira, System Architect

©2014 Cloudera, Inc. All rights reserved.

Overview• Intro• Review Problem• Quick overview of key technology• High level architecture• Deep Dive into NRT Processing• Completing the Puzzle – Micro-batch, Ingest and Batch


Gwen Shapira• 15 years of moving data• Formerly consultant, engineer• System Architect @ Confluent• Kafka Committer• @gwenshap

There’s a Book on That

Founded by creators of Kafka - @jaykreps, @nehanarkhede, @junrao

We help you gather, transport, organize, and analyze all of your stream data

What we offer• Confluent Platform• Kafka plus critical bug fixes not yet applied in Apache release• Kafka ecosystem projects• Enterprise support• Training and Professional Services


The Problem


Credit Card Transaction Fraud


Coupon Fraud


Video Game Strategy


Health Insurance Fraud


How do we React• Human Brain at Tennis

• Muscle Memory• Reaction Thought• Reflective Meditation


Overview of Key Technologies

©2014 Cloudera, Inc. All Rights Reserved.

Kafka


The Basics

• Messages are organized into topics

• Producers push messages• Consumers pull messages• Kafka runs in a cluster. Nodes are called brokers


Topics, Partitions and Logs


Each partition is a log


Each Broker has many partitions

Partition 0 Partition 0

Partition 1 Partition 1

Partition 2

Partition 1

Partition 0

Partition 2 Partion 2


Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client


Producers load balance between partitions

Partition 0

Partition 1

Partition 2

Partition 1

Partition 0

Partition 2

Partition 0

Partition 1

Partion 2

Client

Consumers

Consumer Group Y

Consumer Group X

Consumer

Kafka Cluster

Topic

Partition A (File)

Partition B (File)

Partition C (File)

Consumer

Consumer

Consumer

Order retained with in partition

Order retained with in partition but not over

partitionsOff

Set

X

Off S

et X

Off S

et X

Off S

et Y

Off S

et Y

Off S

et Y

Off sets are kept per consumer group

Consumer-Producer Pattern

Keeping Things Simple• Consume records from Kafka Topic• Filter, transform, join, lookups, aggregate• Write to another Kafka Topic• https://github.com/confluentinc/examples/tree/master/specifi

c-avro-consumer

https://github.com/confluentinc/examples/tree/master/specific-avro-consumer

https://github.com/confluentinc/examples/tree/master/specific-avro-consumer

Kafka Makes Streams Easy• Producers partition the data• Consumers load balance partitions• Add / remove consumers any way you want• Will work with any framework (or none!)

Coming Soon to Kafka Near You

• KafkaConnect - Export / Import for Kafka - 0.9.0 (Its here!)• KStream

• Consumer-Producer client - Processor (0.10.0 - April?)• DSLs:

• KStream (a bit like Spark) - (0.10.0 - April?)• SQL - ???

KConnect - Its a thing• Easy to add connectors to Kafka• Existing connectors

• JDBC• HDFS• MySQL * 2• ElasticSearch * 4• Cassandra• S3 * 2• MQTT• Twitter

• Kafka Connectors:• http://www.confluent.io/developers/connectors• http://docs.confluent.io/2.0.0/connect/index.html

• KStreams:• https://github.com/gwenshap/kafka-examples/blob/master/

KafkaStreamsAvg

http://www.confluent.io/developers/connectors

http://docs.confluent.io/2.0.0/connect/index.html

https://github.com/gwenshap/kafka-examples/blob/master/KafkaStreamsAvg

https://github.com/gwenshap/kafka-examples/blob/master/KafkaStreamsAvg

SparkStreaming


Spark Example1. val conf = new SparkConf().setMaster("local[2]”)

2. val sc = new SparkContext(conf)

3. val lines = sc.textFile(path, 2)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()


Spark Streaming Example1. val conf = new SparkConf().setMaster("local[2]”)

2. val ssc = new StreamingContext(conf, Seconds(1))

3. val lines = ssc.socketTextStream("localhost", 9999)

4. val words = lines.flatMap(_.split(" "))

5. val pairs = words.map(word => (word, 1))

6. val wordCounts = pairs.reduceByKey(_ + _)

7. wordCounts.print()

8. SSC.start()

Spark Streaming

Confidentiality Information Goes Here

DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

Pre-first Batch

First Batch

Second Batch

Confidentiality Information Goes Here

DStream

DStream

DStream

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count

Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count

Pre-first Batch

First Batch

Second Batch

Stateful RDD 1

Print

Stateful RDD 2

Stateful RDD 1


High Level Architecture


Real-Time Event Processing Approach

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientFlume Agents

Hbase / Memory

Spark Streaming

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection

Fetching & Updating Profiles

HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT Changes and

Counters

Local Cache

Kafka

Clients:(Swipe here!)

Web App

Adjust NRT Statistics

Yarn / Mesos

Analytics Layer

SolR

ClientClientKStreams

Analytical Adjustment

s and Pattern

detection


Adjusting NRT Stats Batch Time Adjustments

Review of NRT

Changes and

CountersLocal Cache

Kafka


Web App

Kafka

HDFS

NoSQL

DWH

Connecor

Connector

KStreamProcessor

Profile Updates

Model Updates

Transactions

Local Store

Decisions

DWH

RedoLog

KStreamProcessorKStreamProcessor


NRT Processing


Focus on NRT First

Hadoop Cluster IIStorage Processing

SolR

Hadoop Cluster I

ClientClientProcessor

Hbase / Memory

Spark Streaming

HDFS

Hive/ImpalaMap/

ReduceSpark

Search

Automated & Manual

Analytical Adjustments and Pattern detection


HDFSEventSink

SolR Sink

Batch Time Adjustments

Automated & Manual

Review of NRT Changes and

Counters

Local Cache

Kafka


Web App

Adjust NRT Statistics


Streaming Architecture – NRT Event Processing

Kafka

Initial Events TopicEvent Processing Logic

Local Memory

HBase Client

Kafka

Answer Topic

HBase

Kafk

a Co

nsum

er

Kafk

a Pr

oduc

er

Able to respond with in 10s of milliseconds


Partitioned NRT Event Processing

Kafka

Initial Events Topic

Event Processing Logic

Local Cache

HBase Client

Kafka

Answer Topic

HBase

Kafk

a Co

nsum

er

Kafk

a Pr

oduc

er

TopicPartition A

Partition B

Partition C

Producer

Partitioner

Producer

Partitioner

Producer

Partitioner

Custom Partitioner

Better use of local memory


Questions?http://confluent.io

@confluentInc@gwenshap

[email protected]

mailto:[email protected]

Date post:	15-Apr-2017
Category:	Technology
Upload:	gwen-chen-shapira
View:	1,433 times
Download:	0 times

Fraud Detection for Israel BigThings Meetup

Technology