Big data – can it deliver speed and accuracy v1

Big Data AnalyticsCan it deliver speed and accuracy? Risk & Compliance Engineering, Paypal

Gurinder S. Grewal

This deck contains generic architecture information, and does not reflect the exact details of current or planned systems.

June 2013

about PayPal

• 123MM active users

• 190 markets, 25 currencies

• $300,000 payments processed/minute

• 2B+ events/day

• 12 TB new data added per day

• 500K+ real time queries per second

• < 100ms average response time

we are talking, a lot of data …big data!

what is big data?

transactions

interactions

observations

petabytes of data

diverse analytics

variety of data structures

hadoop

large number of characteristics

large map/reduce cluster

terradata

Growing complexity and expectations

Emerging technologies in the modern world are opening up possibilities for sophisticated analytics.

Data infrastructure is growing, so are the expectations - make decisions fast and with higher accuracy!

Fra

ud S

ophi

stic

atio

n

Data C

omplexity

time

Simple rules, black/white lists

Linear Modes, aggregated variables

Location, Time Analysis

low

high

lowhigh

Inline Histories Analysis

Consistency

Networks..

Time taken to make a decision

Decisions must be quick

• A gang of cyber-criminals stole $45 million in a matter of hours

• More than 36,000 transactions were made worldwide and about $40 million was stolen in 6 hours

Source: http://www.huffingtonpost.com/2013/05/09/atm-fraud_n_3248331.html

Bus

ine

ss V

alue

80

low

high

Prevention

Fast Detection

High Fraud Loss

Fra

ud lo

ss

low

high

Decisions must be accurate

11:01AM11:01AM

11:05AM11:05AM

11:06AM11:06AM

• Credit card used from three distance locations in short timeResult based on realtime analysis: Block the card, not decided?

• According to past purchasing behavior•Card holder lives in US - wife paid bill online from home PC•Card holder’s kid studies in Europe - used card to purchase books•Card holder travels to Japan - paid for lunchResult based on historical analysis: It’s a legit usage

Do we have conflicting requirements?

speed•analyze data incoming at high velocity in split second•consume data in timely manner to make decisions

accuracy •utilize powerful analytics techniques (text mining, predictive analysis)•processing large variety and volume of data (details)

cost•can’t spend a dollar to save a penny – pick a right tool for right job

Tiered Big Data Strategy

cost, speed

data volume, accuracy

effective decision = fn(accuracy, speed, cost)

data age

secondshours

years

Data in-motion

Data in-use

Big Data - Computation Strategy

Offline(map-reduce, batch)

Offline variablesOnline variables

Near Real-time(complex event

processing)

Realtime (in-flow processing)

• fast, very stringent availability and performance SLA’s• computations are simple and eventually accurate• computations are transient, short lived (user sessions)

• event-driven, incremental processing• high efficiency and scalability• data for short time windows (hours)

• optimized for throughput• computations are slow and accurate• data captured as events for historical analysis

Hadoop Technology Stack

Big Data In use - Offline Ecosystem

HDFS HBase

Map Reduce Framework

Data Storage

Data Processing Data Integration

ETL

Flume, Sqoop

Programming Languages

Pig Hive QL

Scheduling, Coordination

Zookeeper

Oozie

UI Framework/SDK

Hue Hue SDK

Structured Data

Unstructured Data

MPP DW RDBMS

Big Data in Motion – Online ecosystem

Complex Event Processing

correlations

filtering

aggregations

pattern matching In-memory data store

Message Bus Offline

Decision Service

Events stream

CEP enables continuous analytics on data in motion

• Solution for velocity of big data

• Well suited for detection, decisioning, alerting and taking actions

• Relies on in-memory data grid for ability to provide low latency

Monitoring

Big Data movement

Offline

Data movement between offline and online is the key and biggest challenge

•ETL jobs require custom coding, biggest bottleneck

•Data transfer very expensive, slow across networks, multiple data centers

•Online data stores are not optimized for parallel or bulk loads

Slows down data store during ETL operation

Negatively impacts online applications availability

Data CloudData Cloud Offline

Big Data movement evolution

Offline

In-memory data store

Offline

NoSQL (persistent backing store)

In-memory data storeTwo-tier architecture

Data CloudData Cloud

Data CloudData Cloud

Initial state•500GB in 16 hours

Optimization – Phase 1•2 TB in 16 hours•Split data files prepared offline•Maximize data load parallelism•Maximum data compression•Optimize data format•Validation before data movement

Scale – Phase 2•10 TB in 6 hours•Add persistent NoSQL behind in-memory store•Blast bulk load into NoSQL store •Batch process will warm the cache •Lazy warm-up as needed, while serving r/w•Refresh cache contents via time based evictions

Batch

Multi-tier architecture

Confidential and Proprietary14

Use case: Graph based decisioning

Map/Reduce Graph builder

In-memory graph store

Online Graph ServerDaily

incremental updates

Continuous graph

updates and rollup

• Generate graph and associated complex variables on Hadoop on daily basis• Move the incremental changes to online in-memory graph store • Based on event stream, keep graph, offline variables up-to-date• In-memory store provides fast read only access to Decision services

Decision Service

Avg. read time: 2ms95th percentile: 6ms

Events stream

offline online

Confidential and Proprietary15

• Hadoop is best for offline processing of variety and volume data – not for real time

• CEP is a solution for online, big data in motion (velocity), complements Hadoop

• Harness true power of big data by combining offline and online data

• Data integration is a key – careful planning and optimization is needed

• Online data stores are not optimized for highly parallel writes, bulk loads

• Big data can solve complex problems while delivering speed and accuracy

Conclusion

Date post:	13-Nov-2014
Category:	Technology
Upload:	gurinderg
View:	660 times
Download:	0 times